History log of /openbsd-current/sys/kern/kern_synch.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.205 03-Jun-2024 claudio

Remove the now unsued s argument to SCHED_LOCK and SCHED_UNLOCK.

The SPL level is not tacked by the mutex and we no longer need to track
this in the callers.
OK miod@ mlarkin@ tb@ jca@


# 1.204 22-May-2024 claudio

When clearing the wait channel also clear the wait message.

There is no reason to keep the wait message in place since it will
never show up even in ddb show proc output.
OK jca@


# 1.203 20-May-2024 claudio

Rework interaction between sleep API and exit1() and start unlocking ps_threads

This diff adjusts how single_thread_set() accounts the threads by using
ps_threadcnt as initial value and counting all threads out that are already
parked. In single_thread_check call exit1() before decreasing ps_singlecount
this is now done in exit1().

exit1() and thread_fork() ensure that ps_threadcnt is updated with the
pr->ps_mtx held and in exit1() also account for exiting threads since
exit1() can sleep.

OK mpi@


# 1.202 18-Apr-2024 claudio

Clear PCATCH for procs that have P_WEXIT set.

Exiting procs will not return to userland and can not deliver signals so
it is better to not even try.
OK mpi@


# 1.201 30-Mar-2024 mpi

Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.

Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.

Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.

This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.

ok claudio@


Revision tags: OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.204 22-May-2024 claudio

When clearing the wait channel also clear the wait message.

There is no reason to keep the wait message in place since it will
never show up even in ddb show proc output.
OK jca@


# 1.203 20-May-2024 claudio

Rework interaction between sleep API and exit1() and start unlocking ps_threads

This diff adjusts how single_thread_set() accounts the threads by using
ps_threadcnt as initial value and counting all threads out that are already
parked. In single_thread_check call exit1() before decreasing ps_singlecount
this is now done in exit1().

exit1() and thread_fork() ensure that ps_threadcnt is updated with the
pr->ps_mtx held and in exit1() also account for exiting threads since
exit1() can sleep.

OK mpi@


# 1.202 18-Apr-2024 claudio

Clear PCATCH for procs that have P_WEXIT set.

Exiting procs will not return to userland and can not deliver signals so
it is better to not even try.
OK mpi@


# 1.201 30-Mar-2024 mpi

Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.

Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.

Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.

This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.

ok claudio@


Revision tags: OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.203 20-May-2024 claudio

Rework interaction between sleep API and exit1() and start unlocking ps_threads

This diff adjusts how single_thread_set() accounts the threads by using
ps_threadcnt as initial value and counting all threads out that are already
parked. In single_thread_check call exit1() before decreasing ps_singlecount
this is now done in exit1().

exit1() and thread_fork() ensure that ps_threadcnt is updated with the
pr->ps_mtx held and in exit1() also account for exiting threads since
exit1() can sleep.

OK mpi@


# 1.202 18-Apr-2024 claudio

Clear PCATCH for procs that have P_WEXIT set.

Exiting procs will not return to userland and can not deliver signals so
it is better to not even try.
OK mpi@


# 1.201 30-Mar-2024 mpi

Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.

Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.

Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.

This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.

ok claudio@


Revision tags: OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.202 18-Apr-2024 claudio

Clear PCATCH for procs that have P_WEXIT set.

Exiting procs will not return to userland and can not deliver signals so
it is better to not even try.
OK mpi@


# 1.201 30-Mar-2024 mpi

Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.

Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.

Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.

This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.

ok claudio@


Revision tags: OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.201 30-Mar-2024 mpi

Prevent a recursion inside wakeup(9) when scheduler tracepoints are enabled.

Tracepoints like "sched:enqueue" and "sched:unsleep" were called from inside
the loop iterating over sleeping threads as part of wakeup_proc(). When such
tracepoints were enabled they could result in another wakeup(9) possibly
corrupting the sleepqueue.

Rewrite wakeup(9) in two stages, first dequeue threads from the sleepqueue then
call setrunnable() and possible tracepoints for each of them.

This requires moving unsleep() outside of setrunnable() because it messes with
the sleepqueue.

ok claudio@


Revision tags: OPENBSD_7_4_BASE OPENBSD_7_5_BASE
# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.200 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.199 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.198 16-Aug-2023 claudio

Move SCHED_LOCK after sleep_signal_check.

sleep_signal_check() is there to look for pending signals / single thread
requests which were posted before sleep_setup() finished. Once p_stat
is set to SSLEEP the wakeup and delivery of signals is taken care of
by ptsignal and single_thread_set().

Moving the SCHED_LOCK further down allows to cleanup cursig() and to
remove a SCHED_LOCK recursion in single_thread_check().

OK mpi@


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.197 14-Aug-2023 mpi

Extend scheduler tracepoints to follow CPU jumping.

- Add two new tracpoints sched:fork & sched:steal
- Include selected CPU number in sched:wakeup
- Add sched:unsleep corresponding to sched:sleep which matches add/removal
of threads on the sleep queue

ok claudio@


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.196 10-Aug-2023 claudio

Add some KASSERT on the proc p_stat in sleep_finish()
OK mpi@


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.195 14-Jul-2023 claudio

struct sleep_state is no longer used, remove it.
Also remove the priority argument to sleep_finish() the code can use
the p_flag P_SINTR flag to know if the signal check is needed or not.
OK cheloha@ kettenis@ mpi@


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.194 11-Jul-2023 claudio

Rework sleep_setup()/sleep_finish() to no longer hold the scheduler lock
between calls.

Instead of forcing an atomic operation across multiple calls use a three
step transaction.
1. setup sleep state by calling sleep_setup()
2. recheck sleep condition to ensure that the event did not fire before
sleep_setup() registered the proc onto the sleep queue
3. call sleep_finish() to either sleep or keep on running based on the
step 2 outcome and any possible signal delivery

To make this work wakeup from signals, single thread api and wakeup(9) need
to be aware if a process is between step 1 and step 3 so that the process
is not enqueued back onto the runqueue while going to sleep. Introduce
the p_flag P_WSLEEP to detect this situation.

On top of this remove the spl dance in msleep() which is no longer required.
It is ok to process interrupts between step 1 and 3.

OK mpi@ cheloha@


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.193 28-Jun-2023 claudio

First step at removing struct sleep_state.

Pass the timeout and sleep priority not only to sleep_setup() but also
to sleep_finish(). With that sls_timeout and sls_catch can be removed
from struct sleep_state.

The timeout is now setup first thing in sleep_finish() and no longer as
last thing in sleep_setup(). This should not cause a noticeable difference
since the code run between sleep_setup() and sleep_finish() is minimal.

OK kettenis@


# 1.192 01-Jun-2023 claudio

Change wakeup_proc() to no longer grab the SCHED_LOCK() instead it must
be called with SCHED_LOCK() held. Also add an extra argument to update
the process flags p_flag so that the timeout handler can set the
P_TIMEOUT flag before making the process runnable.
OK mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.191 15-Feb-2023 mvs

Relax kernel lock assertion within tsleep(9). The `nowake' channel is
the special case which doesn't expect wakeup(9), so allow to use it
without kernel lock held.

Discussed with and ok by claudio@


Revision tags: OPENBSD_7_2_BASE
# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.190 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.189 28-Jun-2022 bluhm

Use btrace(8) to debug reference counting. dt(4) provides a static
tracepoint for each type of refcnt we have. As a start, add inpcb
and tdb refcnt. When the counter changes, btrace may print the
actual object, the current counter, the change value and optionally
the stack trace.
discussed with visa@; OK mpi@


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.188 12-Jun-2022 visa

Allow sleeping while clearing a sleep timeout

Since sys/kern/kern_timeout.c r1.84, timeout_barrier() has used sleeping
with soft-interrupt-driven timeouts. Adjust the sleep machinery so that
the timeout clearing can block in sleep_finish().

This adds one step of recursion inside sleep_finish(). However, the
sleep queue handling does not recurse because sleep_finish() completes
it before calling timeout_del_barrier().

This fixes the following panic:

panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: file "sys/kern/kern_synch.c", line 373
Stopped at db_enter+0x10: popq %rbp
db_enter() at db_enter+0x10
panic() at panic+0xbf
__assert() at __assert+0x25
sleep_setup() at sleep_setup+0x1d8
cond_wait() at cond_wait+0x46
timeout_barrier() at timeout_barrier+0x109
timeout_del_barrier() at timeout_del_barrier+0xa2
sleep_finish() at sleep_finish+0x16d
tsleep() at tsleep+0xb2
sys_nanosleep() at sys_nanosleep+0x12d
syscall() at syscall+0x374

OK mpi@ dlg@


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.187 13-May-2022 claudio

Use the process ps_mtx to protect the process sigacts structure.
With this cursig(), postsig() and trapsignal() become safe to be called
without KERNEL_LOCK. As a side-effect sleep with PCATCH no longer needs
the KERNEL_LOCK either. Since sending a signal can happen from interrupt
context raise the ps_mtx IPL to high.
Feedback from mpi@ and kettenis@
OK kettenis@


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.186 30-Apr-2022 visa

Enforce proper memory ordering in refcnt_rele() and refcnt_finalize()

Make refcnt_rele() and refcnt_finalize() order memory operations so that
preceding loads and stores happen before 1->0 transition. Also ensure
that loads and stores that depend on the transition really begin only
after the transition has occurred. Otherwise the object destructor might
not see the object's latest state.

OK bluhm@


Revision tags: OPENBSD_7_1_BASE
# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.185 18-Mar-2022 bluhm

Cleanup reference counting. Remove #ifdef DIAGNOSTIC to keep the
code similar in non DIAGNOSTIC case. Rename refcnt variable to
refs for consistency with r_refs. Add KASSERT() in refcnt_finalize().
OK visa@


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.184 16-Mar-2022 visa

Add refcnt_shared() and refcnt_read()

refcnt_shared() checks whether the object has multiple references.
When refcnt_shared() returns zero, the caller is the only reference
holder.

refcnt_read() returns a snapshot of the counter value.

refcnt_shared() suggested by dlg@.

OK dlg@ mvs@


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.183 10-Mar-2022 bluhm

Use atomic load and store functions to access refcnt and wait
variables. Although not necessary everywhere, using atomic functions
exclusively for variables marked as atomic is clearer.
OK mvs@ visa@


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.182 19-Feb-2022 deraadt

tsleep() prints a stack trace when cold==2. The suspend/resume code has
phases where sleeps are not allowed, and this used to discover it.
msleep() needs the same check.


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.181 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.180 07-Oct-2021 mpi

Remove the assertion that `curproc' must be SONPROC if found on the sleepqueue.

If `curproc' finds itself on the sleepqueue inside wakeup(9) it is obviously
being executed. Such wakeup(9) currently happens inside the critical section
of the SCHED_LOCK(), generally before cpu_switchto(). However `p_stat' is
changed many operations before cpu_switchto() and the KASSERT() isn't helpful
at catching real bugs.

One example of this is a call to rwsleep() that calls wakeup() via rw_exit()
before sleep_finish(), contented futex(2) triggers that a lot.

Another example are dt(4)'s scheduler TRACEPOINT() in setrunqueue() and
mi_switch().

Suggested by and ok kettenis@


Revision tags: OPENBSD_7_0_BASE
# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.179 09-Sep-2021 mpi

Add THREAD_PID_OFFSET to tracepoint arguments that pass a TID to userland.

Bring these values in sync with the `tid' builtin which already include
the offset. This is necessary to build script comparing them, like:

tracepoint:sched:enqueue
{
@ts[arg0] = nsecs;
}

tracepoint:sched:on__cpu
/@ts[tid]/
{
latency = nsecs - @ts[tid];
}

Discussed with and ok bluhm@


# 1.178 09-Sep-2021 mpi

Move a check to avoid panicing on contended rwlock(9) outside of DIAGNOSTIC.

ok kettenis@


Revision tags: OPENBSD_6_9_BASE
# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.177 04-Mar-2021 mpi

Merge issignal() and CURSIG() in preparation for turning it mp-safe.

This makes appear some redundant & racy checks.

ok semarie@


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.176 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.175 08-Feb-2021 mpi

Simplify sleep_setup API to two operations in preparation for splitting
the SCHED_LOCK().

Putting a thread on a sleep queue is reduce to the following:

sleep_setup();
/* check condition or release lock */
sleep_finish();

Previous version ok cheloha@, jmatthew@, ok claudio@


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.174 11-Jan-2021 claudio

Simplify sleep signal handling a bit by introducing sleep_signal_check().
The common code is moved to sleep_signal_check() and instead of multiple
state variables for sls_sig and sls_unwind only one sls_sigerr is set.
This simplifies the checks in sleep_finish_signal() a great bit.
Idea from and OK mpi@


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.173 24-Dec-2020 cheloha

tsleep(9): add global "nowake" channel for threads avoiding wakeup(9)

It would be convenient if there were a channel a thread could sleep on
to indicate they do not want any wakeup(9) broadcasts. The easiest way
to do this is to add an "int nowake" to kern_synch.c and extern it in
sys/systm.h. You use it like this:

#include <sys/systm.h>

tsleep_nsec(&nowait, ...);

There is now no need to handroll a local dead channel, e.g.

int chan;

tsleep_nsec(&chan, ...);

which expands the stack. Local dead channels will be replaced with
&nowake in later patches.

One possible problem with this "one global channel" approach is sleep
queue congestion. If you have lots of threads sleeping on &nowake you
might slow down a wakeup(9) on a different channel that hashes into
the same queue. Unsure how much of problem this actually is, if at all.

NetBSD and FreeBSD have a "pause" interface in the kernel that chooses
a suitable channel automatically. To keep things simple and avoid
adding a new interface we will start with this global channel.

Discussed with mpi@, claudio@, kettenis@, and deraadt@.

Basically designed by kettenis@, who vetoed my other proposals.

Bugs caught by deraadt@, tb@, and patrick@.


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.172 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.171 23-Oct-2020 cheloha

sleep_setup_timeout(): always KASSERT that P_TIMEOUT is unset

Even if we aren't setting a timeout, P_TIMEOUT should not be set at
this point in the sleep.

ok visa@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.170 06-Apr-2020 claudio

Fix single thread behaviour in sleep_setup_signal(). If a thread needs to
suspend (SINGLE_SUSPEND or SINGLE_PTRACE) it needs to do this in
sleep_setup_signal(). This way the case where single_thread_clear() is
called before the sleep gets its wakeup call can be correctly handled and
the thread is put back to sleep in sleep_finish(). If the wakeup happens
before unsuspend then p_wchan is 0 and the thread will not go to sleep again.
In case of a unwind an error is returned causing the thread to return
immediatly with that error.
With and OK mpi@ kettenis@


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.169 31-Mar-2020 claudio

Move sleep_finish_all() down to where sleep_finish() and all other
sleep_setup/finish related functions are.
OK kettenis@


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.168 26-Mar-2020 claudio

Revert Rev 1.164. Setting sls_sig to 0 uncovered a bunch of issues when it
comes to setting a process into single thread mode. It is still worng but
first the interaction with single_thread_set() must be corrected.


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.167 23-Mar-2020 visa

Prevent tsleep(9) with PCATCH from returning immediately without error
when called during execve(2). This was a caused by initializing sls_sig
with value 0 in r1.164 of kern_synch.c. Previously, tsleep(9) returned
immediately with EINTR in similar circumstances.

The immediate return without error can cause a system hang. For example,
vwaitforio() could end up spinning if called during execve(2) because
the thread did not enter sleep and other threads were not able to finish
the I/O.

tsleep
vwaitforio
nfs_flush
nfs_close
VOP_CLOSE
vn_closefile
fdrop
closef
fdcloseexec
sys_execve

Fix the issue by checking (p->p_flag & P_SUSPSINGLE) instead of
(p->p_p->ps_single != NULL) in sleep_setup_signal(). The former is more
selective than the latter and allows the thread that invokes execve(2)
enter sleep normally.

Bug report, change bisecting and testing help by Pavel Korovin

OK claudio@ mpi@


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.166 20-Mar-2020 cheloha

__thrsleep(2): ensure timeout is set when calling tsleep_nsec(9)

tsleep_nsec(9) will not set a timeout if the nsecs parameter is
equal to INFSLP (UINT64_MAX). We need to limit the duration to
MAXTSLP (UINT64_MAX - 1) to ensure a timeout is set.


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.165 20-Mar-2020 cheloha

__thrsleep(2): fix absolute timeout check

An absolute timeout T elapses when the clock has reached time T, i.e.
when T is less than or equal to the clock's current time.

But the current code thinks T elapses only when the clock is strictly
greater than T.

For example, if my absolute timeout is 1.00000000, the current code will
not return EWOULDBLOCK until the clock reaches 1.00000001. This is wrong:
my absolute timeout elapses a nanosecond prior to that point.

So the timespeccmp(3) here should be

timespeccmp(tsp, &now, <=)

and not

timespeccmp(tsp, &now, <)

as it is currently.


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.164 13-Mar-2020 claudio

Initialize sls_sig to 0 and not 1. sls_sig stores the signal number of a
possible signal that was caught during sleep setup. It does not make sense
to have a default of 1 (SIGHUP) for this.
OK visa@ mpi@


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.163 02-Mar-2020 bluhm

msleep() and rwsleep() allow to release the lock when going to
sleep. If sleep_setup_signal() detects that the process has been
stopped, it calls mi_switch() instead of sleeping. Then the lock
was not released and other processes got stuck. Move the mtx_leave()
and rw_exit() before sleep_setup_signal() to prevent that a stopped
process holds a short term kernel lock.
input kettenis@; OK visa@ tedu@


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.162 30-Jan-2020 mpi

Split `p_priority' into `p_runpri' and `p_slppri'.

Using different fields to remember in which runqueue or sleepqueue
threads currently are will make it easier to split the SCHED_LOCK().

With this change, the (potentially boosted) sleeping priority is no
longer overwriting the thread priority. This let us get rids of the
logic required to synchronize `p_priority' with `p_usrpri'.

Tested by many, ok visa@


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.161 24-Jan-2020 cheloha

*sleep_nsec(9): log process name and pid when nsecs == 0

We included DIAGNOSTIC in *sleep_nsec(9) when they were first committed
to help us sniff out divison-to-zero bugs when converting *sleep(9)
callers to the new interfaces.

Recently we exposed the new interface to userland callers. This has
yielded some warnings.

This diff adds a process name and pid to the warnings to help determine
the source of the zero-length sleeps.

ok mpi@


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.160 21-Jan-2020 mpi

Import dt(4) a driver and framework for Dynamic Profiling.

The design is fairly simple: events, in the form of descriptors on a
ring, are being produced in any kernel context and being consumed by
a userland process reading /dev/dt.

Code and hooks are all guarded under '#if NDT > 0' so this commit
shouldn't introduce any change as long as dt(4) is disable in GENERIC.

ok kettenis@, visa@, jasper@, deraadt@


# 1.159 21-Jan-2020 visa

Make __thrsleep(2) and __thrwakeup(2) MP-safe

Threads in __thrsleep(2) are tracked using queues, one queue per each
process for synchronization between threads of a process, and one
system-wide queue for the special ident -1 handling. Each of these
queues has an associated rwlock that serializes access.

The queue lock is released when calling copyin() and copyout() in
thrsleep(). This preserves the existing behaviour where a blocked copy
operation does not prevent other threads from making progress.

Tested by anton@, claudio@
OK anton@, claudio@, tedu@, mpi@


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.158 16-Jan-2020 mpi

Introduce wakeup_proc() a function to un-SSTOP/SSLEEP a thread.

This moves most of the SCHED_LOCK() related to protecting the sleepqueue
and its states to kern/kern_sync.c

Name suggestion from jsg@, ok kettenis@, visa@


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.157 14-Jan-2020 mpi

Introduce TIMESPEC_TO_NSEC() and use it to convert userland facing
tsleep(9) to tsleep_nsec(9).

ok bluhm@


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.156 12-Jan-2020 cheloha

*sleep_nsec(9): sleep *at least* the given number of nanoseconds

The *sleep(9) interfaces are challenging to use when one needs to sleep
for a given minimum duration: the programmer needs to account for both
the current tick and any integer division when converting an interval
to a count of ticks. This sort of input conversion is complicated and
ugly at best and error-prone at worst.

This patch consolidates this conversion logic into the *sleep_nsec(9)
functions themselves. This will allow us to use the functions at the
syscall layer and elsewhere in the kernel where guaranteeing a minimum
sleep duration is of vital importance.

With input from bluhm@, guenther@, ratchov@, tedu@, and kettenis@.

Requested by mpi@ and kettenis@.

Conversion algorithm from mpi@.

ok mpi@, kettenis@, deraadt@


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.155 30-Nov-2019 visa

Move kernel locking inside the sleep machinery. This enables calling
rwsleep(9) with PCATCH and rw_enter(9) with RW_INTR without the kernel
lock. In addition, now tsleep(9) with PCATCH should be safe to use
without the kernel lock if the sleep is purely time-based.

Tested by anton@, cheloha@, chris@
OK anton@, cheloha@


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.154 12-Nov-2019 visa

Check sleep timeout state only if the sleep has a timeout. Otherwise,
the timeout cancellation in sleep_finish_timeout() would acquire the
kernel lock every time in the no-timeout case, as noticed by mpi@.

This also reduces the contention of timeout_mutex.

OK mpi@, feedback guenther@


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.153 15-Oct-2019 mpi

Reduce the number of places where `p_priority' and `p_stat' are set.

This refactoring will help future scheduler locking, in particular to
shrink the SCHED_LOCK().

No intended behavior change.

ok visa@


Revision tags: OPENBSD_6_6_BASE
# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.152 01-Oct-2019 cheloha

*sleep_nsec(9): add missing newlines to DIAGNOSTIC logs


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.151 10-Jul-2019 mpi

Stop sleeping at PUSER.

This allows to enforce that sleeping priorities will now always be <
PUSER.

ok visa@, ratchov@


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.150 03-Jul-2019 cheloha

Add tsleep_nsec(9), msleep_nsec(9), and rwsleep_nsec(9).

Equivalent to their unsuffixed counterparts except that (a) they take
a timeout in terms of nanoseconds, and (b) INFSLP, aka UINT64_MAX (not
zero) indicates that a timeout should not be set.

For now, zero nanoseconds is not a strictly valid invocation: we log a
warning on DIAGNOSTIC kernels if we see such a call. We still sleep
until the next tick in such a case, however. In the future this could
become some sort of poll... TBD.

To facilitate conversions to these interfaces: add inline conversion
functions to sys/time.h for turning your timeout into nanoseconds.

Also do a few easy conversions for warmup and to demonstrate how
further conversions should be done.

Lots of input from mpi@ and ratchov@. Additional input from tedu@,
deraadt@, mortimer@, millert@, and claudio@.

Partly inspired by FreeBSD r247787.

positive feedback from deraadt@, ok mpi@


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.149 18-Jun-2019 visa

Ensure that timeout p_sleep_to is not left running when finishing sleep.
This is necessary when invoking sleep_finish_timeout() without the
kernel lock. If not cancelled properly, an already running endtsleep()
might cause a spurious wakeup on the thread if the thread re-enters
a sleep queue very quickly before the handler completes.

The flag P_TIMEOUT should stay cleared across the timeout cancellation.
Add an assertion for that.

OK mpi@


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.148 23-Apr-2019 visa

Remove file name and line number output from witness(4)

Reduce code clutter by removing the file name and line number output
from witness(4). Typically it is easy enough to locate offending locks
using the stack traces that are shown in lock order conflict reports.
Tricky cases can be tracked using sysctl kern.witness.locktrace=1 .

This patch additionally removes the witness(4) wrapper for mutexes.
Now each mutex implementation has to invoke the WITNESS_*() macros
in order to utilize the checker.

Discussed with and OK dlg@, OK mpi@


Revision tags: OPENBSD_6_5_BASE
# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.147 23-Jan-2019 cheloha

Sprinkle a pinch of timerisvalid/timespecisvalid over the rest of sys/kern


Revision tags: OPENBSD_6_4_BASE
# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.146 31-May-2018 guenther

Add sleep_finish_all(), which provides the common combo of sleep_finish(),
sleep_finish_timeout(), and sleep_finish_signal() with error preferencing,
and then use it in five places.

ok mpi@


# 1.145 28-May-2018 cheloha

rwsleep: generalize to support both read- and write-locks.

Wanted for tentative clock_nanosleep(2) diff, but maybe useful
elsewhere in the future.

ok mpi@


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.144 24-Apr-2018 pirofti

Validate timespec and return ECANCELED when interrupted with SA_RESTART.

Discussing with mpi@ and guenther@, we decided to first fix the existing
semaphore implementation with regards to SA_RESTART and POSIX compliant
returns in the case where we deal with restartable signals.

Currently we return EINTR everywhere which is mostly incorrect as the
user can not know if she needs to recall the syscall or not. Return
ECANCELED to signal that SA_RESTART was set and EINTR otherwise.

Regression tests pass and so does the posixsuite. Timespec validation
bits are needed to pass the later.

OK mpi@, guenther@


Revision tags: OPENBSD_6_3_BASE
# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.143 14-Dec-2017 dlg

add code to provide simple wait condition handling.

this will be used to replace the bare sleep_state handling in a
bunch of places, starting with the barriers.


# 1.142 04-Dec-2017 mpi

Use _kernel_lock_held() instead of __mp_lock_held(&kernel_lock).

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.141 18-May-2017 mpi

Do not panic if we find ourself on the sleep queue while being SONPROC.

If the rwlock passed to rwsleep(9) is contented, the CPU will call wakeup()
between sleep_setup() and sleep_finish(). At this moment curproc is on the
sleep queue but marked as SONPROC. Avoid panicing in this case.

Problem reported by sthen@

ok kettenis@, visa@


# 1.140 20-Apr-2017 visa

Hook up mutex(9) to witness(4).


# 1.139 20-Apr-2017 visa

Hook up rwlock(9) to witness(4).

Loosely based on a diff from Christian Ludwig


Revision tags: OPENBSD_6_1_BASE
# 1.138 31-Jan-2017 mpi

Remove the inifioctl hack, checking for an unheld NET_LOCK() in
tsleep(9) & friends seem to only produce false positives and cannot
be easily disabled.


# 1.137 25-Jan-2017 mpi

Introduce a hack to remove false-positives when looking for memory
allocation that can sleep while holding the NET_LOCK().

To be removed once we're confident the remaining code paths are safe.

Discussed with deraadt@


# 1.136 21-Jan-2017 guenther

p_comm is the process's command and isn't per thread, so move it from
struct proc to struct process.

ok deraadt@ kettenis@


# 1.135 13-Sep-2016 mpi

Introduce rwsleep(9), an equivalent to msleep(9) but for code protected
by a write lock.

ok guenther@, vgross@


# 1.134 03-Sep-2016 akfaew

Remove ticket lock support from thrsleep. It's unused.

OK guenther@ mpi@ tedu@


Revision tags: OPENBSD_6_0_BASE
# 1.133 06-Jul-2016 tedu

fix several places where calculating ticks could overflow.
it's not enough to assign to an unsigned type because if the arithmetic
overflows the compiler may decide to do anything. so change all the
long long casts to uint64_t so that we start with the right type.
reported by Tim Newsham of NCC.
ok deraadt


# 1.132 04-Jul-2016 tedu

switch calculuated thrsleep timeout to unsigned to prevent overflow
into negative values, which later causes a panic.
reported by Tim Newsham at NCC.
ok guenther


# 1.131 29-Mar-2016 jsg

add back $OpenBSD$


# 1.130 28-Mar-2016 kettenis

Make sure that a thread that calls sched_yield(2) ends up on the run queue
behind all other threads in the process by temporarily lowering its priority.
This isn't optimal but it is the easiest way to guarantee that we make
progress when we're waiting on an other thread to release a lock. This
results in significant improvements for processes that suffer from lock
contention, most notably firefox. Unfortunately this means that sched_yield(2)
needs to grab the kernel lock again.

All the hard work was done by mpi@, based on observations of the behaviour
of the BFS scheduler diff by Michal Mazurek.

ok deraadt@


# 1.129 09-Mar-2016 mpi

Correct some comments and definitions, from Michal Mazurek.


Revision tags: OPENBSD_5_9_BASE
# 1.128 01-Feb-2016 dlg

branches: 1.128.2;
add a DIAGNOSTIC for refcnt_take overflow.

ok mpi@


# 1.127 15-Jan-2016 dlg

KASSERT on refcnt underflow.

ok mpi@ bluhm@


# 1.126 23-Nov-2015 mpi

Do not include <sys/atomic.h> inside <sys/refcnt.h>.

Prevent lazy developers, like David and I, to use atomic operations
without including <sys/atomic.h>.

ok dlg@


# 1.125 28-Sep-2015 deraadt

satisfy RAMDISK by placing cold == 2 case inside #ifdef DDB


# 1.124 28-Sep-2015 deraadt

In low-level suspend routines, set cold=2. In tsleep(), use this to
spit out a ddb trace to console. This should allow us to find suspend
or resume routines which break the rules. It depends on the console
output function being non-sleeping.... but that's another codepath which
should try to be safe when cold is set.
ok kettenis


# 1.123 11-Sep-2015 dlg

introduce a wrapper around reference counts called refcnt.

its basically atomic inc/dec, but it includes magical sleep code
in refcnt_finalise that is better written once than many times.
refcnt_finalise sleeps until all references are released and does
so with sleep_setup and sleep_finalize, which is fairly subtle.

putting this in now so i we can get on with work in the stack, a
proper discussion about visibility and how available intrinsics
should be in the kernel can happen after next week.

with help from guenther@
ok guenther@ deraadt@ mpi@


# 1.122 07-Sep-2015 guenther

Delete ktracing of context switches: it's unused, and not particularly useful,
and doing VOP_WRITE() from inside tsleep/msleep makes the locking too
complicated, making it harder to move forward on MP changes.

ok deraadt@ kettenis@


Revision tags: OPENBSD_5_8_BASE
# 1.121 12-May-2015 mikeb

branches: 1.121.4;
Drop and reacquire the kernel lock in the vfs_shutdown and "cold"
portions of msleep and tsleep to give interrupts a chance to run
on other CPUs.

Tweak and OK kettenis


# 1.120 07-May-2015 mikeb

msleep(9) must prevent kernel from attempting a context switch
during autoconf and after panics.

Tweak and OK guenther, OK miod


# 1.119 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.118 10-Feb-2015 blambert

assert that we hold the scheduler lock in unsleep()

ok guenther@


# 1.117 09-Feb-2015 dlg

we want to defer work traditionally (in openbsd) handled in an
interrupt context to a taskq running in a thread. however, there
is a concern that if we do that then we allow accidental use of
sleeping APIs in this work, which will make it harder to move the
work back to interrupts in the future.

guenther and kettenis came up with the idea of marking a proc with
CANTSLEEP which the sleep paths can check and panic on.

this builds on that so you create taskqs that run with CANTSLEEP
set except when they need to sleep for more tasks to run.

the taskq_create api is changed to take a flags argument so users
can specify CANTSLEEP. MPSAFE is also passed via this flags field
now. this means archs that defined IPL_MPSAFE to 0 can now create
mpsafe taskqs too.

lots of discussion at s2k15
ok guenther@ miod@ mpi@ tedu@ pelikan@


Revision tags: OPENBSD_5_6_BASE
# 1.116 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.115 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.114 23-Jan-2014 guenther

msleep() with a NULL mtx argument is a programming error.

ok matthew@ phessler@ dlg@


# 1.113 23-Jan-2014 guenther

Waiting on a condition without a lock is an error: you need *some* lock
to guarantee there isn't a window in which you can lose a wakeup. The
exception for tsleep() is when it's just being used to sleep for a period
of time, so permit that too.

ok jsing@ deraadt@


# 1.112 24-Dec-2013 dlg

get rid of if (timeout_pending()) timeout_del(). this is racy. any
conditionals you did on timeout_pending can now be done on timeout_del
now that it returns what it did.

ok and a very good fix from kettenis@


# 1.111 25-Nov-2013 tedu

rename magicnumber to globalsleepaddr


# 1.110 18-Nov-2013 tedu

hack in a global rendezvous for interprocess semaphores to use


# 1.109 09-Nov-2013 guenther

Add KASSERT()s to tsleep() and msleep() to verify that bogus flags
aren't being passed to them. Fix UVM_WAIT() to not pass PNORELOCK to
tsleep(), as that flag only does something with msleep().

ok beck@ dlg@


# 1.108 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.107 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.106 01-Jun-2013 tedu

cleanup and consolidate the spinlock_lock (what a name!) code.
it's now atomic_lock to better reflect its usage, and librthread now
features a new spinlock that's really a ticket lock.
thrlseep can handle both types of lock via a flag in the clock arg.
(temp back compat hack)
remove some old stuff that's accumulated along the way and no longer used.
some feedback from dlg, who is concerned with all things ticket lock.
(you need to boot a new kernel before installing librthread)


# 1.105 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


Revision tags: OPENBSD_5_3_BASE
# 1.104 21-Aug-2012 haesbaert

Stop "inlining" setrunnable() we already had two bugs because of it.
This also makes sure we call cpu_unidle() on the correct cpu, since the
inlining order was wrong and could call it on the old cpu.

ok kettenis@


Revision tags: OPENBSD_5_2_BASE
# 1.103 10-Jul-2012 haesbaert

We should only call need_resched() if the priority is lower than the
priority of the current running process.

In amd64 a call to need_resched() sends an IPI to the other cpu.

This fixes aja@ problem where he would move the mouse and see 60000
IPIs being sent.

Thanks to mikeb@ for bringing that subject up tuesday.
Actually found this after inquiring guenther@ about some changes in
mi_switch().

ok guenther@ aja@


# 1.102 10-Apr-2012 guenther

When converting the timeout to ticks, both round up and add one to account
for the tick that we're already in the middle of.

noted and tested by aja; ok kurt@


# 1.101 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.100 19-Mar-2012 guenther

Add tracing and dumping of "pointer to struct" syscall arguments for
structs timespec, timeval, sigaction, and rlimit.

ok otto@ jsing@


Revision tags: OPENBSD_5_1_BASE
# 1.99 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.98 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.97 07-Jul-2011 guenther

Functions used in files other than where they are defined should be
declared in .h files, not in each .c. Apply that rule to endtsleep(),
scheduler_start(), updatepri(), and realitexpire()

ok deraadt@ tedu@


Revision tags: OPENBSD_4_9_BASE
# 1.96 25-Jan-2011 stsp

Don't ignore copyout() return value in sys_thrsleep().
Spotted by miod some time ago.
ok miod guenther


Revision tags: OPENBSD_4_8_BASE
# 1.95 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.94 10-Jun-2010 deraadt

Declare safepri at the MD level on each platform, so that the kern_synch.c
does not have to deal with it as a common. Some platforms may be missed
by this commit... if you spot one, fix it the same way.
ok miod


Revision tags: OPENBSD_4_7_BASE
# 1.93 27-Dec-2009 guenther

Correct previous commit: match the errno return by thrsleep() in
the already-timed-out case to be the same (EWOULDBLOCK) as when it
times out after sleeping


# 1.92 27-Nov-2009 guenther

Convert thrsleep() to an absolute timeout with clockid to eliminate a
race condition and prep for later support of pthread_condattr_setclock()

"get it in" deraadt@, tedu@, cheers by others


Revision tags: OPENBSD_4_6_BASE
# 1.91 04-Jun-2009 beck

unfuck msleep - fixed by art and ariane after much horror and teeth gnashing
over why the processes were being woken up at splvm after the page daemon
ran - and probably also had the page daemon running at splvm after the first
pass through the loop.
ok art@ weingart@ oga@ ariane@


# 1.90 02-Jun-2009 guenther

Change the wait-channel type to 'const volatile void *', eliminating
the need for casts when calling tsleep(), msleep(), and wakeup().

"I guess so" oga@ "it's masturbation" art@


# 1.89 14-Apr-2009 art

Some tweaks to the cpu affinity code.
- Split up choosing of cpu between fork and "normal" cases. Fork is
very different and should be treated as such.
- Instead of implicitly choosing a cpu in setrunqueue, do it outside
where it actually makes sense.
- Just because a cpu is marked as idle doesn't mean it will be soon.
There could be a thundering herd effect if we call wakeup from an
interrupt handler, so subtract cpus with queued processes when
deciding which cpu is actually idle.
- some simplifications allowed by the above.

kettenis@ ok (except one bugfix that was not in the intial diff)


# 1.88 23-Mar-2009 art

Processor affinity for processes.
- Split up run queues so that every cpu has one.
- Make setrunqueue choose the cpu where we want to make this process
runnable (this should be refined and less brutal in the future).
- When choosing the cpu where we want to run, make some kind of educated
guess where it will be best to run (very naive right now).
Other:
- Set operations for sets of cpus.
- load average calculations per cpu.
- sched_is_idle() -> curcpu_is_idle()

tested, debugged and prodded by many@


Revision tags: OPENBSD_4_5_BASE
# 1.87 10-Sep-2008 blambert

There's no need to fully traverse the wakeup queue when waking a specific
process sleeping on a unique address (wakeup -> wakeup_one)

ok guenther@, tedu@, art@


# 1.86 05-Sep-2008 oga

Back out previous. Art realised a problem with it.


# 1.85 05-Sep-2008 art

Don't overwrite the old ipl in msleep if PNORELOCK was set.


# 1.84 05-Sep-2008 oga

When munging the WANTIPL of the mutex to prevent undoing the sched_lock,
use the constant for IPL_SCHED, and not splsched(), which doesn't do what
we want.

ok art@. Tested by Paul de Weerd.


Revision tags: OPENBSD_4_3_BASE OPENBSD_4_4_BASE
# 1.83 30-Nov-2007 oga

Fix msleep.

Since mutexes mess around with spl levels, and the sched-lock isn't a
mutex, we need to make sure to fix the IPL when msleep does the locking.


ok art.


# 1.82 28-Nov-2007 oga

Add msleep. This is identical to tsleep but it takes a mutex as a
parameter. The mutex is unlocked just before sleep and relocked after
unless P_NORELOCK is in flags, in which case it is left unlocked.

ok art@.


# 1.81 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.80 16-May-2007 art

The world of __HAVEs and __HAVE_NOTs is reducing. All architectures
have cpu_info now, so kill the option.

eyeballed by jsg@ and grange@


# 1.79 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.78 21-Mar-2007 art

Split tsleep into pieces. Instead of doing everything in a large "shove
everything into it" function, there are now 6 stages of tsleep with
an on-stack (remember that kernel stacks are not swappable now?)
structure that keeps track of the state.

This way we first setup the sleep, setup the events that might break the
sleep, finish the sleep (actually sleeping) and then take care of the
events that could wake us up.

In the future this will make it easier to implement functionality like:
setup sleep, release lock or check some condition, finish sleep, in a
race-free way and without duplicating or complicating the tsleep function
too much.

miod@, millert@ ok.


# 1.77 18-Mar-2007 art

Don't restart thrsleep after a signal. After a signal happened and we
weren't on the sleep queues, the condition we were sleeping on might
have changed, so we need to go back to userland and recheck that condition.

This fixes the majority of lockups and and hanging threads in rthreads
since it fixes a race in the semaphore code.

ok tedu@


# 1.76 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_1_BASE
# 1.75 29-Nov-2006 miod

Kernel stack can be swapped. This means that stuff that's on the stack
should never be referenced outside the context of the process to which
this stack belongs unless we do the PHOLD/PRELE dance. Loads of code
doesn't follow the rules here. Instead of trying to track down all
offenders and fix this hairy situation, it makes much more sense
to not swap kernel stacks.

From art@, tested by many some time ago.


# 1.74 21-Oct-2006 tedu

tbert sent me a diff to change some 0 to NULL
i got carried away and deleted a whole bunch of useless casts
this is C, not C++. ok md5


Revision tags: OPENBSD_3_9_BASE OPENBSD_4_0_BASE
# 1.73 30-Dec-2005 tedu

change thrwakeup to take an argument which specifies how many threads
to wakeup.


# 1.72 22-Dec-2005 tedu

fix memory leak conditions in thrsleep and significantly simplify


# 1.71 14-Dec-2005 tedu

timeout code is not so happy with the negative values


# 1.70 14-Dec-2005 tedu

change wait message for thrsleep to "thrsleep"


# 1.69 13-Dec-2005 tedu

stupid me got the cast backwards


# 1.68 13-Dec-2005 tedu

thrsleep and thrwakeup, cast syscall arg from void * to long.


# 1.67 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.66 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.65 15-Nov-2005 pedro

Match comments with reality


Revision tags: OPENBSD_3_8_BASE
# 1.64 17-Jun-2005 niklas

A second approach at fixing the telnet localhost & problem
(but I tend to call it ssh localhost & now when telnetd is
history). This is more localized patch, but leaves us with
a recursive lock for protecting scheduling and signal state.
Better care is taken to actually be symmetric over mi_switch.
Also, the dolock cruft in psignal can go with this solution.
Better test runs by more people for longer time has been
carried out compared to the c2k5 patch.

Long term the current mess with interruptible sleep, the
default action on stop signals and wakeup interactions need
to be revisited. ok deraadt@, art@


# 1.63 29-May-2005 deraadt

sched work by niklas and art backed out; causes panics


# 1.62 25-May-2005 niklas

This patch is mortly art's work and was done *a year* ago. Art wants to thank
everyone for the prompt review and ok of this work ;-) Yeah, that includes me
too, or maybe especially me. I am sorry.

Change the sched_lock to a mutex. This fixes, among other things, the infamous
"telnet localhost &" problem. The real bug in that case was that the sched_lock
which is by design a non-recursive lock, was recursively acquired, and not
enough releases made us hold the lock in the idle loop, blocking scheduling
on the other processors. Some of the other processors would hold the biglock though,
which made it impossible for cpu 0 to enter the kernel... A nice deadlock.
Let me just say debugging this for days just to realize that it was all fixed
in an old diff noone ever ok'd was somewhat of an anti-climax.

This diff also changes splsched to be correct for all our architectures.


Revision tags: OPENBSD_3_6_BASE OPENBSD_3_7_BASE
# 1.61 29-Jul-2004 tedu

put the scheduler in its own file. reduces clutter, and logically separates
"put this process to sleep" and "find a process to run" operations.
no functional change. ok art@


# 1.60 25-Jul-2004 tedu

move db_show_all_procs to kern_proc.c, proc_printit goes in DDB too.
shuffle functions around so that scheduler is all together.
no real functional changes. ok art@ testing miod@


# 1.59 24-Jun-2004 tholo

This moves access to wall and uptime variables in MI code,
encapsulating all such access into wall-defined functions
that makes sure locking is done as needed.

It also cleans up some uses of wall time vs. uptime some
places, but there is sure to be more of these needed as
well, particularily in MD code. Also, many current calls
to microtime() should probably be changed to getmicrotime(),
or to the {,get}microuptime() versions.

ok art@ deraadt@ aaron@ matthieu@ beck@ sturm@ millert@ others
"Oh, that is not your problem!" from miod@


# 1.58 21-Jun-2004 tholo

First step towards more sane time handling in the kernel -- this changes
things such that code that only need a second-resolution uptime or wall
time, and used to get that from time.tv_secs or mono_time.tv_secs now get
this from separate time_t globals time_second and time_uptime.

ok art@ niklas@ nordin@


# 1.57 20-Jun-2004 art

Merge error in smp merge. It's a miracle that people haven't noticed the
scheduling errors on non-i386 yet.

deraadt@ aaron@ ok


# 1.56 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.55 09-Jun-2004 art

Merge in a piece of the SMP branch into HEAD.

Introduce the cpu_info structure, p_cpu field in struct proc and global
scheduling context and various changed code to deal with this. At the
moment no architecture uses this stuff yet, but it will allow us slow and
controlled migration to the new APIs.

All new code is ifdef:ed out.

ok deraadt@ niklas@


Revision tags: OPENBSD_3_5_BASE
# 1.54 26-Jan-2004 deraadt

having the monotonic thing as DEBUG is not going to get it fixed faster, it is just going to annoy people


# 1.53 23-Dec-2003 deraadt

enough is enough, driving people insane is not nice


# 1.52 23-Dec-2003 mickey

print tv_usec fields correctly in reporting conmonotonic time


# 1.51 19-Dec-2003 millert

Add a check for time not flowing monotonically and just don't change
p->p_rtime in this case instead of zeroing it; based on an idea
from nordin@. Also add a printf about microtime() not being monotonic
for this case (from miod@) #ifdef DIAGNOSTIC. This version OK otto@


# 1.50 15-Dec-2003 millert

Fix some sign issues that fell out from the change of rlim_t to unsigned.
Also add a check for a negative result when subtracting microtime(&now)
from runtime and simply treat this as zero. This should *not* happen
but due to an apparent bug in microtime on dual clock machines, it does.
The microtime bug is currently being examined.
Based on a diff from miod@ with help from otto@; ok deraadt@ otto@


# 1.49 15-Dec-2003 deraadt

workaround a clock tick handling bug that the rlimit code just exposed.


Revision tags: OPENBSD_3_4_BASE
# 1.48 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.47 15-Mar-2003 deraadt

kill 10 minute non-root suffers stuff. noted that we still have this, by
matthieu, who noted it now that X is not running as root. ok nordin


Revision tags: UBC_SYNC_B
# 1.46 15-Oct-2002 art

Protect p_priority with splstatclock.


Revision tags: OPENBSD_3_2_BASE
# 1.45 24-Jul-2002 mickey

fix header printing in show_all_procs


# 1.44 03-Jul-2002 miod

Change all variables definitions (int foo) in sys/sys/*.h to variable
declarations (extern int foo), and compensate in the appropriate locations.


# 1.43 11-Jun-2002 art

splassert(IPL_STATCLOCK) mi_switch


Revision tags: OPENBSD_3_1_BASE
# 1.42 14-Mar-2002 millert

First round of __P removal in sys


# 1.41 08-Mar-2002 mickey

semicolon is not always what it seems, replace w/ a \n in asm labels


Revision tags: UBC_BASE
# 1.40 11-Nov-2001 art

branches: 1.40.2;
Let ltsleep take a const wmesg.


# 1.39 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.38 13-Sep-2001 art

Remove a comment that just doesn't make any sense.


# 1.37 07-Aug-2001 art

Change tsleep into ltsleep.
ltsleep takes an additional argument - a simplelock and unlocks it when it's
safe to do so.

tsleep now becomes a wrapper around ltsleep.

From NetBSD


# 1.36 27-Jun-2001 art

remove old vm


# 1.35 24-Jun-2001 mickey

cold is in systm.h now


# 1.34 26-May-2001 art

indentation.


Revision tags: OPENBSD_2_9_BASE
# 1.33 25-Mar-2001 csapuntz

Reintroduce wakeup call


# 1.32 15-Mar-2001 art

Print a '*' in front of curproc in ps in ddb.


# 1.31 27-Feb-2001 csapuntz

Add wakeup_n and wakeup_one. wakeup_n will wakeup up to n sleeping processes


# 1.30 19-Feb-2001 art

When doing an assertion for phz, just do it once when we set phz,
not once per process.


# 1.29 10-Nov-2000 art

Change the ktrace interface functions from taking the trace vnode to taking the
traced proc. The vnode is in the proc and all functions need the proc.


Revision tags: OPENBSD_2_8_BASE
# 1.28 03-Aug-2000 mickey

s/principal/priciple/; from netbsd


# 1.27 06-Jul-2000 art

Typo in comment and some cleanup of roundrobin.


# 1.26 27-Jun-2000 art

Slight optimization of wakeup.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 19-Apr-2000 art

Remove the roundrobin_attempts hack and replace it with per-process scheduling
flags (much nicer for future smp work).
Add two generic functions yield() and preempt(). Use preepmt() in uio when
we are told to yield.
Based on my idea, code written by Jason Thorpe from NetBSD.


# 1.23 23-Mar-2000 art

Don't reinitialize the tsleep and ITIMER_REAL timers all the time.
The function and the argument never change.


# 1.22 23-Mar-2000 art

use the new timeout interface for tsleep.


# 1.21 23-Mar-2000 art

Adapt roundrobin and schedcpu to the new timeout API.


# 1.20 03-Mar-2000 art

Keep track of the number of times we trigger a reschedule before the
context switch actually happens.


# 1.19 03-Mar-2000 art

Use the LIST_FIRST macro to get the head of zombproc list.


# 1.18 03-Mar-2000 art

Use LIST_ macros instead of internal field names to walk the allproc list.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.17 05-Sep-1999 tholo

branches: 1.17.4;
Use stathz to calculate CPU time when available; fixes CPU calculation
problems when stathz runs at different speed than hz/profhz.


# 1.16 15-Aug-1999 pjanzen

Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.15 21-Apr-1999 alex

Improved ps formatting.


Revision tags: OPENBSD_2_5_BASE
# 1.14 26-Feb-1999 art

uvm allocation and name changes


# 1.13 15-Nov-1998 art

GC unnecessary declaration


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 03-Feb-1998 deraadt

bad types; wileyc@sekiya.twics.co.jp


# 1.11 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


# 1.10 04-Nov-1997 chuck

fix printf formatting of "ps" (aka "show all proc") so that lines never
overflow (always hated that).

replaced "/m" flag with:
/a == show process address info
/n == show normal process info [currently the default]
/w == show process wait/emul info


Revision tags: OPENBSD_2_2_BASE
# 1.9 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.8 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.7 28-Jul-1997 deraadt

two unneeded variables; enami@ba2.so-net.or.jp


Revision tags: OPENBSD_2_1_BASE
# 1.6 19-Jan-1997 briggs

asm -> __asm


# 1.5 23-Nov-1996 kstailey

remrq -> remrunqueue


Revision tags: OPENBSD_2_0_BASE
# 1.4 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.3 21-Apr-1996 deraadt

partial sync with netbsd 960418, more to come


# 1.2 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision