History log of /freebsd-11-stable/sys/kern/kern_thread.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 337242 03-Aug-2018 asomers

MFC r336205:

Don't acquire evclass_lock with a spinlock held

When the "pc" audit class is enabled and auditd is running, witness will
panic during thread exit because au_event_class tries to lock an rwlock
while holding a spinlock acquired upstack by thread_exit.

To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is
acquired. Of thread_exit's 16 callers, it's only necessary to call
AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit
(for exiting threads). The other callers are all kernel threads, which
needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls
there will be nothing to audit. And exit1 already does call
AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that
case.

PR: 228444
Reported by: aniketp
Reviewed by: aniketp, kib
Differential Revision: https://reviews.freebsd.org/D16210


# 331727 29-Mar-2018 mjoras

MFC r325621, r325622, r331227

Add EVENTHANDLER_LIST and some users.

Also fix a longstanding bug in mtx initialization.


# 331017 15-Mar-2018 kevans

MFC r317055,r317056 (glebius): Include sys/vmmeter.h as included

r317055: All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.

r317056: Typo!


# 318743 23-May-2017 badger

move p_sigqueue to the end of struct proc

In order to preserve KBI in stable branches, replace the existing
p_sigqueue slot with padding and move the expanded (as of r315949)
p_sigqueue to the end of the struct.

This is a repeat of r317529 (which concerned td_sigqueue in struct
thread) for p_sigqueue in struct proc.

Virtualbox modules (and possibly others) are affected without this fix.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D10843


# 318183 11-May-2017 kib

MFC r317523:
Add asserts to verify stability of struct proc and struct thread layouts.


# 315836 23-Mar-2017 avg

MFC r315074: actually implement proc:::lwp-exit probe


# 315374 16-Mar-2017 mjg

MFC r313391:

Bump struct thread alignment to 32.

This gives additional bits to use in locking primitives which store
the lock thread pointer in the lock value.


# 315064 11-Mar-2017 kib

MFC r314253:
Do not leak mount references for dying threads.


# 304883 27-Aug-2016 kib

MFC r303426:
Rewrite subr_sleepqueue.c use of callouts to not depend on the
specifics of callout KPI.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 302328 03-Jul-2016 kib

Provide helper macros to detect 'non-silent SBDRY' state and to
calculate appropriate return value for stops. Simplify the code by
using them.

Fix typo in sig_suspend_threads(). The thread which sleep must be
aborted is td2. (*)

In issignal(), when handling stopping signal for thread in
TD_SBDRY_INTR state, do not stop, this is wrong and fires assert.
This is yet another place where execution should be forced out of
SBDRY-protected region. For such case, return -1 from issignal() and
translate it to corresponding error code in sleepq_catch_signals().
Assert that other consumers of cursig() are not affected by the new
return value. (*)

Micro-optimize, mostly VFS and VOP methods, by avoiding calling the
functions when SIGDEFERSTOP_NOP non-change is requested. (**)

Reported and tested by: pho (*)
Requested by: bde (**)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 302215 26-Jun-2016 kib

Rewrite sigdeferstop(9) and sigallowstop(9) into more flexible
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.

Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# 301960 16-Jun-2016 kib

Remove XXX comments from kern_thread.c. In one case, there is no
reason for it in modern times. In the other case, expand the comment
stating instead of doubting.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
X-Differential revision: https://reviews.freebsd.org/D6731


# 301959 16-Jun-2016 kib

Remove code duplication.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
X-Differential revision: https://reviews.freebsd.org/D6731


# 301456 05-Jun-2016 kib

Get rid of struct proc p_sched and struct thread td_sched pointers.

p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711


# 300043 17-May-2016 kib

Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held. The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.

Reviewed by: jilles (previous version, supposedly the objection was fixed)
Discussed with: brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 292892 29-Dec-2015 jhb

Call kern_thr_exit() instead of duplicating it.

This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.

Reviewed by: kib
MFC after: 1 week


# 289661 20-Oct-2015 kib

Mark struct thread zone as type-stable.

When establishing the locking state for several lock types (including
blockable mutexes and sx) failed, locking primitives try to spin while
the owner thread is running. The spinning loop performs the test for
running condition by dereferencing the owner->td_state field of the
owner thread. If the owner thread exited while spinner was put off
the processor, it is harmless to access reused struct thread owner,
since in some near future the current processor would notice the owner
change and make appropriate progress. But it could be that the page
which carried the freed struct thread was unmapped, then we fault
(this cannot happen on amd64).

For now, disallowing free of the struct thread seems to be good
enough, and tests which create a lot of threads once, did not
demonstrated regressions.

Reviewed by: jhb, pho
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D3908


# 285633 16-Jul-2015 mjg

Get rid of lim_update_thread and cred_update_thread.

Their primary use was in thread_cow_update to free up old resources.
Freeing had to be done with proc lock held and _cow_ funcs already knew
how to free old structs.


# 285387 11-Jul-2015 adrian

Add an initial NUMA affinity/policy configuration for threads and processes.

This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.

Sponsored by: Norse Corp, Inc.; Dell


# 284215 10-Jun-2015 mjg

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 284214 10-Jun-2015 mjg

Generalised support for copy-on-write structures shared by threads.

Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.


# 283382 24-May-2015 dchagin

In preparation for switching linuxulator to the use the native 1:1
threads add a hook for cleaning thread resources before the thread die.

Differential Revision: https://reviews.freebsd.org/D1038


# 283320 23-May-2015 kib

If thread requested to not stop on non-boundary, then not only
stopping signals should obey, but also all forms of single-threading.
Otherwise, thread might sleep interruptible while owning some
resources, and single-threading thread could try to access them.
An example is owning vnode lock while dumping core.

Submitted by: Conrad Meyer
Review: https://reviews.freebsd.org/D2612
Tested by: pho
MFC after: 1 week


# 283291 22-May-2015 jkim

CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks


# 282944 15-May-2015 kib

Right now, the process' p_boundary_count counter is decremented by the
suspended thread itself, on the return path from
thread_suspend_check(). A consequence is that return from
thread_single_end(SINGLE_BOUNDARY) may leave p_boundary_count
non-zero, it might be even equal to the threads count.

Now, assume that we have two threads in the process, both calling
execve(2). Suppose that the first thread won the race to be the
suspension thread, and that afterward its exec failed for any reason.
After the first thread did thread_single_end(SINGLE_BOUNDARY), second
thread becomes the process suspension thread and checks
p_boundary_count. The non-zero value of the count allows the
suspension loop to finish without actually suspending some threads.
In other words, we enter exec code with some threads not suspended.

Fix this by decrementing p_boundary_count in the
thread_single_end()->thread_unsuspend_one() during marking the thread
as runnable. This way, a return from thread_single_end() guarantees
that the counter is cleared. We do not care whether the unsuspended
thread has a chance to run.

Add some asserts to ensure the state of the process when single
boundary suspension is lifted. Also make thread_unuspend_one()
static.

In collaboration with: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 282679 09-May-2015 kib

Do not return from thread_single(SINGLE_BOUNDARY) until all stopped
thread are guarenteed to be removed from the processors.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 281696 18-Apr-2015 kib

Initialize td_sel in the thread_init(). Struct thread is not zeroed
on the initial allocation, but seltdinit() assumes that td_sel is NULL
or a valid pointer. Note that thread_fini()/seltdfini() also relies
on this, but correctly resets td_sel to NULL.

Submitted by: luke.tw@gmail.com
PR: 199518
MFC after: 1 week


# 279390 28-Feb-2015 kib

The umtx_lock mutex is used by top-half of the kernel, but is
currently a spin lock. Apparently, the only reason for this is that
umtx_thread_exit() is called under the process spinlock, which put the
requirement on the umtx_lock. Note that the witness static order list
is wrong for the umtx_lock, umtx_lock is explicitely before any thread
lock, so it is also before sleepq locks.

Change umtx_lock to be the sleepable mutex. For the reason above, the
calls to umtx_thread_exit() are moved from thread_exit() earlier in
each caller, when the process spin lock is not yet taken.

Discussed with: jhb
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 277528 22-Jan-2015 hselasky

Revert for r277213:

FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438


# 277213 15-Jan-2015 hselasky

Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".

Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste


# 275820 16-Dec-2014 kib

Add missed break.

CID: 1258586
Sponsored by: The FreeBSD Foundation
MFC after: 4 days


# 275745 13-Dec-2014 kib

Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 275617 08-Dec-2014 kib

Do some refactoring and minor cleanups of the thread_single() code in
preparation for the global stop commit.

Move the code to weed suspended or sleeping threads into the
appropriate state, into the helper weed_inhib(). Current code already
has deep nesting and hard to follow [1].

Add currently useless helper remain_for_mode(), which returns the
count of threads which are allowed to run, according to the
single-threading mode.

In thread_single_end(), do not save curthread into local variable, it
is unused after, except to find curproc.

Remove stray empty line.

Requested by: avg [1]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 275616 08-Dec-2014 kib

Thread waiting for the vfork(2)-ed child to exec or exit, must allow
for the suspension.

Currently, the loop performs uninterruptible cv_wait(9) call, which
prevents suspension until child allows further execution of parent.
If child is stopped, suspension or single-threading is delayed
indefinitely.

Create a helper thread_suspend_check_needed() to identify the need for
a call to thread_suspend_check(). It is required since call to the
thread_suspend_check() cannot be safely done while owning the child
(p2) process lock. Only when suspension is needed, drop p2 lock and
call thread_suspend_check(). Perform wait for cv with timeout, in
case suspend is requested after wait started; I do not see a better
way to interrupt the wait.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 275121 26-Nov-2014 kib

The process spin lock currently has the following distinct uses:

- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.

- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().

- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().

- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().

The split is done mostly for code clarity, and should not affect
scalability.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 271008 03-Sep-2014 kib

Style.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 271007 03-Sep-2014 kib

Retire thread_unthread(), it has only one caller. Update comment in
the block of code before the previous call to thread_unthread().

Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 271000 03-Sep-2014 kib

Right now, thread_single(SINGLE_EXIT) returns after the p_numthreads
reaches 1. The p_numthreads counter is decremented in thread_exit() by
a call to thread_unlink(). This means that the exiting threads may
still execute on other CPUs when thread_single(SINGLE_EXIT) returns.
As result, vmspace could be destroyed while paging structures are
still used on other CPUs by exiting threads.

Delay the return from thread_single(SINGLE_EXIT) until all threads are
really destroyed by thread_stash() after the last switch out. The
p_exitthreads counter already provides the required mechanism, move
the wait from the thread_wait() (which is called from wait(2) code)
into thread_single().

Reported by: many (as "panic: pmap active <addr>")
Reviewed by: alc, jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 269095 25-Jul-2014 deischen

Insert new threads at the end of the thread list in the process
instead of at the beginning. This allows an intra process signal
to be sent to the oldest thread with the signal unmasked - which,
if it still exists, is the main thread. This mimics behavior
found in Linux and Solaris.


# 268351 06-Jul-2014 marcel

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# 258622 26-Nov-2013 avg

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 258541 25-Nov-2013 attilio

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 248584 21-Mar-2013 jhb

Another NFS SIGSTOP related fix: Ignore thread suspend requests due to
SIGSTOP if stop signals are currently deferred. This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.

Tested by: pho
Reviewed by: kib
MFC after: 1 week


# 246996 19-Feb-2013 jhb

Fix a few typos.


# 240951 26-Sep-2012 kib

Make the updates of the tid ring buffer' head and tail pointers
explicit by moving them into separate statements from the buffer
element accesses.

Requested by: jhb
MFC after: 3 days


# 240813 22-Sep-2012 kib

Do not skip two elements of the tid_buffer when reusing the buffer
slot. This eventually results in exhaustion of the tid space, causing
new threads get tid -1 as identifier.

The bad effect of having the thread id equal to -1 is that
UMTX_OP_UMUTEX_WAIT returns EFAULT for a lock owned by such thread,
because casuword cannot distinguish between literal value -1 read from
the address and -1 returned as an indication of faulted
access. _thr_umutex_lock() helper from libthr does not check for
errors from _umtx_op_err(2), causing an infinite loop in
mutex_lock_sleep().

We observed the JVM processes hanging and consuming enormous amount of
system time on machines with approximately 100 days uptime.

Reported by: Mykola Dzham <freebsd levsha org ua>
MFC after: 1 week


# 240475 13-Sep-2012 attilio

Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by: bde, jhb
MFC after: 1 week


# 240204 07-Sep-2012 jhb

A few whitespace and comment fixes.


# 239328 16-Aug-2012 kib

Fix grammar.

Submitted by: jh
MFC after: 1 week


# 239301 15-Aug-2012 kib

Add a sysctl kern.pid_max, which limits the maximum pid the system is
allowed to allocate, and corresponding tunable with the same
name. Note that existing processes with higher pids are left intact.

MFC after: 1 week


# 236317 30-May-2012 kib

Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month


# 235459 14-May-2012 rstone

Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives. Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire. These probes are defined
to fire when Solaris-specific features perform certain actions. As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change. These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument. As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after: 2 weeks


# 229429 03-Jan-2012 jhb

Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.

Reviewed by: bde, kib
MFC after: 1 week


# 227657 18-Nov-2011 kib

Consistently use process spin lock for protection of the
p->p_boundary_count. Race could cause the execve(2) from the threaded
process to hung since thread boundary counter was incorrect and
single-threading never finished.

Reported by: pluknet, pho
Tested by: pho
MFC after: 1 week


# 219968 24-Mar-2011 jhb

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


# 218976 23-Feb-2011 pluknet

Clean up the now unused #include statement.

Approved by: kib (mentor)
MFC after: 1 week
X-MFC with: r218972


# 218972 23-Feb-2011 kib

Move the max_threads_per_proc and max_threads_hits variables to the
file where they are used. Declare the kern.threads sysctl node at the
same location. Since no external use for the variables exists, make them
static.

Discussed with: dchagin
MFC after: 1 week


# 216314 09-Dec-2010 davidxu

MFp4:
The unit number allocator reuses ID too fast, this may hide bugs in
other code, add a ring buffer to delay freeing a thread ID.


# 216313 09-Dec-2010 davidxu

MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week


# 213950 17-Oct-2010 davidxu

- Insert thread0 into correct thread hash link list.
- In thr_exit() and kthread_exit(), only remove thread from
hash if it can directly exit, otherwise let exit1() do it.
- In thread_suspend_check(), fix cleanup code when thread needs
to exit.
This change seems fixed the "Bad link elm " panic found by
Peter Holm.

Stress testing: pho


# 213714 11-Oct-2010 davidxu

Add a flag TDF_TIDHASH to prevent a thread from being
added to or removed from thread hash table multiple times.


# 213642 09-Oct-2010 davidxu

Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian


# 210138 15-Jul-2010 jhb

Retire td_syscalls now that it is no longer needed.


# 208488 24-May-2010 kib

Fix the double counting of the last process thread td_incruntime
on exit, that is done once in thread_exit() and the second time in
proc_reap(), by clearing td_incruntime.

Use the opportunity to revert to the pre-RUSAGE_THREAD exporting of ruxagg()
instead of ruxagg_locked() and use it from thread_exit().

Diagnosed and tested by: neel
MFC after: 3 days


# 207606 04-May-2010 kib

Fix typo in comment.

MFC after: 3 days


# 207605 04-May-2010 kib

Remove a comment that merely repeats code.

Submitted by: bde
MFC after: 1 week


# 207602 04-May-2010 kib

Implement RUSAGE_THREAD. Add td_rux to keep extended runtime and ticks
information for thread to allow calcru1() (re)use.

Rename ruxagg()->ruxagg_locked(), ruxagg_tlock()->ruxagg() [1].
The ruxagg_locked() function no longer clears thread ticks nor
td_incruntime.

Requested by: attilio [1]
Discussed with: attilio, bde
Reviewed by: bde
Based on submission by: Alexander Krizhanovsky <ak natsys-lab com>
MFC after: 1 week
X-MFC-Note: td_rux shall be moved to the end of struct thread


# 198464 25-Oct-2009 jkoshy

Inform hwpmc(4) of a thread's impending demise prior to invoking sched_throw().

Debugging help: fabient
Review and testing by: fabient


# 196730 01-Sep-2009 kib

Reintroduce the r196640, after fixing the problem with my testing.

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week


# 196648 29-Aug-2009 kib

Reverse r196640 and r196644 for now.


# 196640 29-Aug-2009 kib

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week


# 195702 14-Jul-2009 kib

Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


# 195701 14-Jul-2009 kib

Move the repeated code to calculate the number of the threads in the
process that still need to be suspended or exited from thread_single
into the new function calc_remaining().

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


# 189845 15-Mar-2009 jeff

- Implement a new mechanism for resetting lock profiling. We now
guarantee that all cpus have acknowledged the cleared enable int by
scheduling the resetting thread on each cpu in succession. Since all
lock profiling happens within a critical section this guarantees that
all cpus have left lock profiling before we clear the datastructures.
- Assert that the per-thread queue of locks lock profiling is aware of
is clear on thread exit. There were several cases where this was not
true that slows lock profiling and leaks information.
- Remove all objects from all lists before clearing any per-cpu
information in reset. Lock profiling objects can migrate between
per-cpu caches and previously these migrated objects could be zero'd
before they'd been removed

Discussed with: attilio
Sponsored by: Nokia


# 185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 184667 05-Nov-2008 davidxu

Revert rev 184216 and 184199, due to the way the thread_lock works,
it may cause a lockup.

Noticed by: peter, jhb


# 184199 23-Oct-2008 davidxu

Actually, for signal and thread suspension, extra process spin lock is
unnecessary, the normal process lock and thread lock are enough. The
spin lock is still needed for process and thread exiting to mimic
single sched_lock.


# 183929 16-Oct-2008 davidxu

Restore code wrongly removed in SVN revision 173004, it causes threaded
process to be stuck in execv().

Noticed by: delphij


# 183911 15-Oct-2008 davidxu

Move per-thread userland debugging flags into seperated field,
this eliminates some problems of locking, e.g, a thread lock is needed
but can not be used at that time. Only the process lock is needed now
for new field.


# 182011 22-Aug-2008 jhb

A suspended thread can, in fact, be swapped out. Thus,
thread_unsuspend_one() needs to optionally wakeup the swapper. Since we
hold the thread lock for that entire function, however, we have to push
that requirement up into the caller.

Found by: rwatson


# 181695 13-Aug-2008 attilio

Introduce some WITNESS improvements:
- Speedup the lock orderings lookup modifying the witness graph from a
linked tree to a matrix. A table lookup caches the lock orderings in
order to make a O(1) access for them. Any witness object has an unique
index withing this lookup cache table.
- Reduce the lock contention on w_mtx acquiring it only when the LOR
actually happens and not in a sane case. In order to do this don't totally
flush lock lists (per-CPU spinlocks list and per-thread sleeplocks list)
but check for ll_count anytime we need to have to verify allocations sanity.
- Introduce the function witness_thread_exit() in the witness namespace which
should verify a thread doesn't hold any witness occurrence why exiting.
- Rename the sysctl debug.witness.graphs into debug.witness.fullgraph and
add debug.witness.badstacks which prints out stacks for LOR revealed.
This is implemented using the stack(9) support, which makes WITNESS to be
dependent by the STACK option or by the DDB (including STACK) option.
- Fix style(9) for src/sys/kern/subr_witness.c

The hash table approach has been developed by Ilya Maykov on the behalf of
Isilon Systems which kindly released the patch.
Jeff Roberson, ported the patch to -CURRENT and fixed w_mtx contention, on the
behalf of Nokia.

Submitted by: Ilya Maykov <ivmaykov at gmail dot com> (Isilon Systems), jeff
Sponsored by: Nokia


# 181334 05-Aug-2008 jhb

If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks


# 178272 17-Apr-2008 jeff

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


# 177471 21-Mar-2008 jeff

- Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs
to enter thread_suspend_check().
- Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the
thread_suspend_check() to ast() rather than userret().
- Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so
that we don't miss a suspend request. If this is set use the
expensive signal path.
- Set NEEDSUSPCHK when creating a new thread in thr in case the
creating thread is due to be suspended as well but has not yet.

Reviewed by: davidxu (Authored original patch)


# 177427 20-Mar-2008 jeff

- There is no sense in calling sched_newthread() at thread_init() and
thread_fini(). The schedulers initialize themselves properly during
sched_fork_thread() anyhow. fini is only called when we're returning
the memory to the allocator which surely doesn't care what state the
memory is in.


# 177369 19-Mar-2008 jeff

- Restore the NULL check for td_cpuset. This can happen if a partially
constructed thread was torn down as is the case when we fail to allocate
a kernel stack.


# 177368 19-Mar-2008 jeff

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 177085 12-Mar-2008 jeff

- Pass the priority argument from *sleep() into sleepq and down into
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.

Reviewed by: jhb, peter


# 177083 12-Mar-2008 jeff

- KSE may free a thread that was never actually forked. This will leave
td_cpuset NULL. Check for this condition before dereferencing the
cpuset.

Reported by: david@catwhisker.org, miwi@freebsd.org
Sponsored by: Nokia


# 176730 02-Mar-2008 jeff

Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.

Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine


# 174848 22-Dec-2007 julian

give thread0 the tid 100000 and bumpt the others to start at 100001

MFC after: 1 week


# 174647 16-Dec-2007 jeff

Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
structures.
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch


# 174629 15-Dec-2007 jeff

- Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled. There is no longer an embedded lock_profile_object
in each lock. Instead a list of lock_profile_objects is kept per-thread
for each lock it may own. The cnt_hold statistic is now always 0 to
facilitate this.
- Support shared locking by tracking individual lock instances and
statistics in the per-thread per-instance lock_profile_object.
- Make the lock profiling hash table a per-cpu singly linked list with a
per-cpu static lock_prof allocator. This removes the need for an array
of spinlocks and reduces cache contention between cores.
- Use a seperate hash for spinlocks and other locks so that only a
critical_enter() is required and not a spinlock_enter() to modify the
per-cpu tables.
- Count time spent spinning in the lock statistics.
- Remove the LOCK_PROFILE_SHARED option as it is always supported now.
- Specifically drop and release the scheduler locks in both schedulers
since we track owners now.

In collaboration with: Kip Macy
Sponsored by: Nokia


# 173631 15-Nov-2007 rrs

- Adds event handlers for process_ctor,process_dtor, process_init,
process_fini, thread_ctor, thread_dtor, thread_init, thread_fini. This
will allow us to extend dynamically areas in proc/thread for dtrace ;-)
Reviewed by: rwatson


# 173625 15-Nov-2007 julian

This time REALLY copy the name from the proc to the thread as a default.


# 173615 14-Nov-2007 marcel

o Rename cpu_thread_setup() to cpu_thread_alloc() to better
communicate that it relates to (is called by) thread_alloc()
o Add cpu_thread_free() which is called from thread_free()
to counter-act cpu_thread_alloc().

i386: Have cpu_thread_free() call cpu_thread_clean() to
preserve behaviour.
ia64: Have cpu_thread_free() call mtx_destroy() for the
mutex initialized in cpu_thread_alloc().

PR: ia64/118024


# 173601 14-Nov-2007 julian

A bunch more files that should probably print out a thread name
instead of a process name.


# 173599 14-Nov-2007 julian

Make sure there is a good default thread name for all threads.


# 173361 05-Nov-2007 kib

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


# 173004 26-Oct-2007 julian

Introduce a way to make pure kernal threads.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.

kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.

All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.

fix top to show kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.

make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'

man page fixes to follow.


# 172263 21-Sep-2007 jeff

- Call sched_sleep() before we suspend threads. sched_wakeup() is already
called via setrunnable(). This allows time slept while suspended to
be accounted for swap.

Approved by: re


# 172256 20-Sep-2007 attilio

Fix some entries in the locks static table of witness.
In particular:
- smp_tlb_mtx is no longer used, so it is axed.
- smp rendezvous lock isn't really a leaf spin-mutex. Its bad placement in
the table, however, has been the source of a false positive LOR reporting
with the dt_lock. However, smp rendezvous lock would have had sched_lock
there for older lock, so it wasn't still a leaf lock.
- allpmaps is only used in ia32 architecture, so it is inserted in the
appropriate stub.

Addictionally:
- kse_zombie_lock is no longer present, so its definition is axed out.
- zombie_lock doesn't need to have an exported symbol, so just let's it be
declared as static.

Tested by: kris
Approved by: jeff (mentor)
Approved by: re


# 172207 17-Sep-2007 jeff

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 171611 27-Jul-2007 attilio

Actually, upcalls cannot be freed while destroying the thread because we
should call uma_zfree() with various spinlock helds. Rearranging the
code would not help here because we cannot break atomicity respect
prcess spinlock, so the only one choice we have is to defer the operation.
In order to do this use a global queue synchronized through the kse_lock
spinlock which is freed at any thread_alloc() / thread_wait() through a
call to thread_reap().
Note that this approach is not ideal as we should want a per-process
list of zombie upcalls, but it follows initial guidelines of KSE authors.

Tested by: jkim, pav
Approved by: jeff, julian
Approved by: re


# 171556 23-Jul-2007 attilio

Actually, KSE kernel bits locking is broken and can lead likely to
dangerous races.
Fix this problems adding correct locking for the members of 'struct
kse_upcall' and other struct proc/struct thread related members.
For the moment, just leave ku_mflag and ku_flags "lazy" locked.
While here, cleanup the code removing the function kse_GC() (unused),
and merging upcall_link(), upcall_unlink(), upcall_stash() in their
respective callers (static functions, very short and only called in one
place).

Reported by: pav
Tested by: pav (on some pointyhat cluster nodes)
Approved by: jeff
Approved by: re
Sponsorized by: NGX Italy (http://www.ngx.it)


# 170630 12-Jun-2007 jeff

- Garbage collect unused concurrency functions.
- Remove unused kse fields from struct proc.
- Group remaining fields and #ifdef KSE them.
- Move some kern_kse.c only prototypes out of proc and into kern_kse.

Discussed with: Julian


# 170598 12-Jun-2007 jeff

Solve a complex exit race introduced with thread_lock:
- Add a count of exiting threads, p_exitthreads, to struct proc.
- Increment p_exithreads when we set the deadthread in thread_exit().
- When we thread_stash() a deadthread use an atomic to drop the count.
- Spin until the p_exithreads count reaches 0 in thread_wait().
- Lock the last exiting thread momentarily to be certain that it has
exited cpu_throw().
- Restructure thread_wait(). It does not need a loop as there will only
ever be one thread.

Tested by: moose@opera.com
Reported by: kris, moose@opera.com


# 170466 09-Jun-2007 attilio

The current rusage code show peculiar problems:
- Unsafeness on ruadd() in thread_exit()
- Unatomicity of thread_exiit() in the exit1() operations

This patch addresses these problems allocating p_fd as part of the
process and modifying the way it is accessed.

A small chunk of this patch, resolves a race about p_state in kern_wait(),
since we have to be sure about the zombif-ing process.

Submitted by: jeff
Approved by: jeff (mentor)


# 170296 04-Jun-2007 jeff

Commit 4/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
- Move some common code into thread_suspend_switch() to handle the
mechanics of suspending a thread. The locking here is incredibly
convoluted and should be simplified.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 170292 04-Jun-2007 attilio

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 170174 31-May-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 167944 27-Mar-2007 jhb

Align 'struct thread' on 16 byte boundaries so that the lower 4 bits are
always 0. Previously we aligned threads on a minimum of 8-byte boundaries.

Note: This changes the uma zone to no longer cache align threads. We
really want the uma zone to do align threads to MAX(16, cache line size)
but there currently isn't a good way to express that to uma.

Submitted by: attilio


# 167352 09-Mar-2007 mohans

Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.


# 165693 31-Dec-2006 rwatson

Prefer a more traditional spelling of inhibited in comments and panic
messages.


# 165348 19-Dec-2006 davidxu

Remove unused sysctls.


# 164936 06-Dec-2006 julian

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# 163714 27-Oct-2006 davidxu

Remove member p_procscopegrp which is no longer used by libthr.


# 163709 26-Oct-2006 jb

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# 161678 28-Aug-2006 davidxu

This is initial version of POSIX priority mutex support, a new userland
mutex structure is added as following:
struct umutex {
__lwpid_t m_owner;
uint32_t m_flags;
uint32_t m_ceilings[2];
uint32_t m_spare[4];
};
The m_owner represents owner thread, it is a thread id, in non-contested
case, userland can simply use atomic_cmpset_int to lock the mutex, if the
mutex is contested, high order bit will be set, and userland should do locking
and unlocking via kernel syscall. Flag UMUTEX_PRIO_INHERIT represents
pthread's PTHREAD_PRIO_INHERIT mutex, which when contention happens, kernel
should do priority propagating. Flag UMUTEX_PRIO_PROTECT indicates it is
pthread's PTHREAD_PRIO_PROTECT mutex, userland should initialize m_owner
to contested state UMUTEX_CONTESTED, then atomic_cmpset_int will be failure
and kernel syscall should be invoked to do locking, this becauses
for such a mutex, kernel should always boost the thread's priority before
it can lock the mutex, m_ceilings is used by PTHREAD_PRIO_PROTECT mutex,
the first element is used to boost thread's priority when it locked the mutex,
second element is used when the mutex is unlocked, the PTHREAD_PRIO_PROTECT
mutex's link list is kept in userland, the m_ceiling[1] is managed by thread
library so kernel needn't allocate memory to keep the link list, when such
a mutex is unlocked, kernel reset m_owner to UMUTEX_CONTESTED.
Flag USYNC_PROCESS_SHARED indicate if the synchronization object is process
shared, if the flag is not set, it saves a vm_map_lookup() call.

The umtx chain is still used as a sleep queue, when a thread is blocked on
PTHREAD_PRIO_INHERIT mutex, a umtx_pi is allocated to support priority
propagating, it is dynamically allocated and reference count is used,
it is not optimized but works well in my tests, while the umtx chain has
its own locking protocol, the priority propagating protocol are all protected
by sched_lock because priority propagating function is called with sched_lock
held from scheduler.

No visible performance degradation is found which these changes. Some parameter
names in _umtx_op syscall are renamed.


# 160048 30-Jun-2006 maxim

o Fix typo in the comment.

PR: kern/99632
Submitted by: clsung


# 156942 21-Mar-2006 davidxu

Rethink it a bit, if there is a STOP flag, don't bother to resume other
threads.


# 156934 21-Mar-2006 davidxu

Because JOB control has higher priority than single threading in
thread_suspend_check(), call thread_stopped() to report SIGCHLD
if there is JOB control in progress.


# 156705 14-Mar-2006 davidxu

1. Count last time slice, this intends to fix
"calcru: runtime went backwards" bug for threaded process.
2. Add comment about possible logical problem with scheduler.

MFC after: 3 days


# 156680 13-Mar-2006 davidxu

Remove unused code.


# 155741 15-Feb-2006 davidxu

Fix a long standing race between sleep queue and thread
suspension code. When a thread A is going to sleep, it calls
sleepq_catch_signals() to detect any pending signals or thread
suspension request, if nothing happens, it returns without
holding process lock or scheduler lock, this opens a race
window which allows thread B to come in and do process
suspension work, however since A is still at running state,
thread B can do nothing to A, thread A continues, and puts
itself into actually sleeping state, but B has never seen it,
and it sits there forever until B is woken up by other threads
sometimes later(this can be very long delay or never
happen). Fix this bug by forcing sleepq_catch_signals to
return with scheduler lock held.
Fix sleepq_abort() by passing it an interrupted code, previously,
it worked as wakeup_one(), and the interruption can not be
identified correctly by sleep queue code when the sleeping
thread is resumed.
Let thread_suspend_check() returns EINTR or ERESTART, so sleep
queue no longer has to use SIGSTOP as a hack to build a return
value.

Reviewed by: jhb
MFC after: 1 week


# 155594 13-Feb-2006 davidxu

In order to speed up process suspension on MP machine, send IPI to
remote CPU. While here, abstract thread suspension code into a function
called sig_suspend_threads, the function is called when a process received
a STOP signal.


# 155376 05-Feb-2006 rwatson

When exiting a thread, submit any pending record. Today, we don't
audit thread exit, but should that happen, this will prevent
unhappiness, as the thread exit system call will never return, and
hence not commit the record.

Pointed out by/with: cognet
Obtained from: TrustedBSD Project


# 155353 05-Feb-2006 rwatson

When GC'ing a thread, assert that it has no active audit record.
This should not happen, but with this assert, brueffer and I would
not have spent 45 minutes trying to figure out why he wasn't
seeing audit records with the audit version in CVS.

Obtained from: TrustedBSD Project


# 155195 01-Feb-2006 rwatson

Add new fields to process-related data structures:

- td_ar to struct thread, which holds the in-progress audit record during
a system call.

- p_au to struct proc, which holds per-process audit state, such as the
audit identifier, audit terminal, and process audit masks.

In the earlier implementation, td_ar was added to the zero'd section of
struct thread. In order to facilitate merging to RELENG_6, it has been
moved to the end of the data structure, requiring explicit
initalization in the thread constructor.

Much help from: wsalamon
Obtained from: TrustedBSD Project


# 153253 09-Dec-2005 davidxu

Now SIGCHLD is always queued.


# 152948 30-Nov-2005 davidxu

Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.


# 152185 08-Nov-2005 davidxu

Add support for queueing SIGCHLD same as other UNIX systems did.

For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.

The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.

There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.

Tested on: i386 (SMP and UP)


# 151990 02-Nov-2005 davidxu

Add thread_find() function to search a thread by lwpid.


# 151585 23-Oct-2005 davidxu

Make p_itimers as a pointer, so file sys/proc.h does not need to include
sys/timers.h.


# 151576 23-Oct-2005 davidxu

Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.


# 151316 14-Oct-2005 davidxu

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 150743 30-Sep-2005 davidxu

Fox a LOR of sleep and sched_lock by using a timeout wait
when process reaches maximum number of threads.

MFC after: 3 days


# 146670 27-May-2005 davidxu

Remove sleep queue hack, it is no longer needed with current sleep queue.
Actually, it causes process to hang when it is being debugged.

PR: gnu/77818


# 145433 23-Apr-2005 davidxu

Change cpu_set_kse_upcall to more generic style, so we can reuse it
in other codes. Add cpu_set_user_tls, use it to tweak user register
and setup user TLS. I ever wanted to merge it into cpu_set_kse_upcall,
but since cpu_set_kse_upcall is also used by M:N threads which may
not need this feature, so I wrote a separated cpu_set_user_tls.


# 143944 21-Mar-2005 julian

Fix code freeing wrong cred pointer.

Submitted by: das
Noticed by: Coverity tool
MFC after: 3 days

Note: usually the two pointers point to the same
thing but it was still a bug.


# 143840 19-Mar-2005 phk

Sleeping is not allowed in uma->fini


# 143802 18-Mar-2005 phk

Use subr_unit to allocate thread ID's with.

Tested by: davidxu


# 143149 05-Mar-2005 davidxu

Allocate umtx_q from heap instead of stack, this avoids
page fault panic in kernel under heavy swapping.


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 138843 14-Dec-2004 jeff

- Garbage collect several unused members of struct kse and struce ksegrp.
As best as I can tell, some of these were never used.


# 137946 20-Nov-2004 das

Remove local definitions of RANGEOF() and use __rangeof() instead.
Also remove a few bogus casts.


# 137281 05-Nov-2004 davidxu

Respect TDF_SINTR, don't suspend uninterruptible thread.


# 137279 05-Nov-2004 davidxu

Backout previous commit, the P_STOPPED_BOUNDARY flag was already
cleared at the begin of thread_single() when needed.


# 137231 04-Nov-2004 davidxu

Don't forget to turn off P_SINGLE_BOUNDARY for thread_single(SINGLE_EXIT),
otherwise a threaded process which calls execv() will hang in kernel and
may can not be killed!


# 136448 12-Oct-2004 jhb

Whitespace fix.


# 136177 05-Oct-2004 davidxu

In original kern_execve() code, at the start of the function, it forces
all other threads to suicide, problem is execve() could be failed, and
a failed execve() would change threaded process to unthreaded, this side
effect is unexpected.
The new code introduces a new single threading mode SINGLE_BOUNDARY, in
the mode, all threads should suspend themself at user boundary except
the singler. we can not use SINGLE_NO_EXIT because we want to start from
a clean state if execve() is successful, suspending other threads at unknown
point and later resuming them from there and forcing them to exit at user
boundary may cause the process to start from a dirty state. If execve() is
successful, current thread upgrades to SINGLE_EXIT mode and forces other
threads to suicide at user boundary, otherwise, other threads will be resumed
and their interrupted syscall will be restarted.

Reviewed by: julian


# 136171 05-Oct-2004 julian

Slight cleanup in the single threading code.

MFC after: 4 days


# 136160 05-Oct-2004 julian

Break out to a separate function, the code to revert a multithreaded
process back to officially being a non-threaded program.

MFC after: 4 days


# 136100 03-Oct-2004 julian

Always strt out with an initilalised ksegrp structure.

MFC after: 3 days


# 135780 24-Sep-2004 julian

Use the universal 'threaded process' flag rather than the
specific tests for different threading systems.

MFC after: 1 week


# 135573 22-Sep-2004 jhb

Various small style fixes.


# 135269 15-Sep-2004 julian

Try harder to get back to being a non threaded process.

Submitted by: DavidXu
MFC after: 3 days


# 134791 05-Sep-2004 julian

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 134498 29-Aug-2004 davidxu

Only test return_instead if P_SINGLE_EXIT is set, otherwise a fork()
syscall can interrupt other thread's syscall in sleepq_catch_signals().
Current, all callers know thread_suspend_check may suspend thread
itself, so we need't to check return_instead for normal suspension
flags (no P_SINGLE_EXIT set).

Tested by: deischen
Reported by: Maarten L. Hekkelman <m.hekkelman@cmbi.kun.nl>


# 134013 19-Aug-2004 jhb

Now that the return value semantics of cv's for multithreaded processes
have been unified with that of msleep(9), further refine the sleepq
interface and consolidate some duplicated code:
- Move the pre-sleep checks for theaded processes into a
thread_sleep_check() function in kern_thread.c.
- Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c.
Specifically, if a thread is awakened by something other than a signal
while checking for signals before going to sleep, clear TDF_SINTR in
sleepq_catch_signals(). This removes a sched_lock lock/unlock combo in
that edge case during an interruptible sleep. Also, fix
sleepq_check_signals() to properly handle the condition if TDF_SINTR is
clear rather than requiring the callers of the sleepq API to notice
this edge case and call a non-_sig variant of sleepq_wait().
- Clarify the flags arguments to sleepq_add(), sleepq_signal() and
sleepq_broadcast() by creating an explicit submask for sleepq types.
Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of
0. Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and
move the setting of TDF_SINTR to sleepq_add() if this flag is set rather
than sleepq_catch_signals(). Note that it is the caller's responsibility
to ensure that sleepq_catch_signals() is called if and only if this flag
is passed to the preceeding sleepq_add(). Note that this also removes a
sched_lock lock/unlock pair from sleepq_catch_signals(). It also ensures
that for an interruptible sleep, TDF_SINTR is always set when
TD_ON_SLEEPQ() is true.


# 133713 14-Aug-2004 julian

Whitespace nit.


# 133396 09-Aug-2004 julian

Increase the amount of data exported by KTR in the KTR_RUNQ setting.
This extra data is needed to really follow what is going on in the
threaded case.


# 133234 06-Aug-2004 rwatson

In thread_exit(), include more information about the thread/process
context in the KTR trace record. In particular, include the same
information as passed for mi_switch() and fork_exit() KTR trace
records.


# 132987 01-Aug-2004 green

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# 132372 18-Jul-2004 julian

When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter


# 132265 16-Jul-2004 jhb

Whitespace fix.


# 132087 13-Jul-2004 davidxu

Add code to support debugging threaded process.

1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
thread current user thread is running on. Add tm_dflags into
kse_thr_mailbox, the flags is written by debugger, it tells
UTS and kernel what should be done when the process is being
debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.

TMDF_SSTEP is used to tell kernel to turn on single stepping,
or turn off if it is not set.

TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
whenever possible, to UTS, it means do not run the user thread
until debugger clears it, this behaviour is necessary because
gdb wants to resume only one thread when the thread's pc is
at a breakpoint, and thread needs to go forward, in order to
avoid other threads sneak pass the breakpoints, it needs to remove
breakpoint, only wants one thread to go. Also, add km_lwp to
kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
switch time when process is not being debugged, so when process
is attached, debugger can map kernel thread to user thread.

2. Add p_xthread to proc strcuture and td_xsig to thread structure.
p_xthread is used by a thread when it wants to report event
to debugger, every thread can set the pointer, especially, when
it is used in ptracestop, it is the last thread reporting event
will win the race. Every thread has a td_xsig to exchange signal
with debugger, thread uses TDF_XSIG flag to indicate it is reporting
signal to debugger, if the flag is not cleared, thread will keep
retrying until it is cleared by debugger, p_xthread may be
used by debugger to indicate CURRENT thread. The p_xstat is still
in proc structure to keep wait() to work, in future, we may
just use td_xsig.

3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
a thread. When process stops, debugger can set the flag for
thread, thread will check the flag in thread_suspend_check,
enters a loop, unless it is cleared by debugger, process is
detached or process is existing. The flag is also checked in
ptracestop, so debugger can temporarily suspend a thread even
if the thread wants to exchange signal.

4. Current, in ptrace, we always resume all threads, but if a thread
has already a TDF_DBSUSPEND flag set by debugger, it won't run.

Encouraged by: marcel, julian, deischen


# 131473 02-Jul-2004 jhb

- Change mi_switch() and sched_switch() to accept an optional thread to
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.


# 131149 26-Jun-2004 marcel

Allocate TIDs in thread_init() and deallocate them in thread_fini().
The overhead of unconditionally allocating TIDs (and likewise,
unconditionally deallocating them), is amortized across multiple
thread creations by the way UMA makes it possible to have type-stable
storage.
Previously the cost was kept down by having threads created as part
of a fork operation use the process' PID as the TID. While this had
some nice properties, it also introduced complexity in the way TIDs
were allocated. Most importantly, by using the type-stable storage
that UMA gives us this was also unnecessary.

This change affects how core dumps are created and in particular how
the PRSTATUS notes are dumped. Since we don't have a thread with a
TID equalling the PID, we now need a different way to preserve the
old and previous behavior. We do this by having the given thread (i.e.
the thread passed to the core dump code in td) dump it's state first
and fill in pr_pid with the actual PID. All other threads will have
pr_pid contain their TIDs. The upshot of all this is that the debugger
will now likely select the right LWP (=TID) as the initial thread.

Credits to: julian@ for spotting how we can utilize UMA.
Thanks to: all who provided julian@ with test results.


# 130877 21-Jun-2004 julian

Mark the thread in an exiting program as inactive.
This is not really used by the process but it's confusing to some
status readers to see zombie processes the "runnin" threads.

Pointed out by: Don Lewis <truckman@FreeBSD.org>


# 130735 19-Jun-2004 marcel

Define __lwpid_t as an int32_t in <sys/_types.h> and define lwpid_t
as an __lwpid_t in <sys/types.h>. Retype td_tid from an int to a
lwpid_t and change related definitions accordingly.


# 130674 18-Jun-2004 davidxu

If thread singler wants to terminate other threads, make sure it includes
all threads except itself.

Obtained from: julian


# 130355 11-Jun-2004 julian

Shuffle some code around.


# 130269 09-Jun-2004 jmallett

Add a comment explaining td_critnest's initial state and its life from that
point on, as it happens relatively indirectly, and in a codepath the casual
reader may not be acquainted with or find obvious.

Glanced at by: jhb


# 130199 07-Jun-2004 julian

Split kern_thread.c into 2 parts. kern_kse.c and kern_thread.c
Kern_kse has already been committed.
This separates out the KSE threading ABI from generic thread support.


# 129989 02-Jun-2004 tjr

Move TDF_SA from td_flags to td_pflags (and rename it accordingly)
so that it is no longer necessary to hold sched_lock while
manipulating it.

Reviewed by: davidxu


# 129547 21-May-2004 davidxu

Clear KSE thread flags after KSE thread mode is ended. The side effect
of not clearing the flags for execv() syscall will result that a new
program runs in KSE thread mode without enabling it.

Submitted by: tjr
Modified by: davidxu


# 128721 28-Apr-2004 deischen

Keep track of threads waiting in kse_release() to avoid a race
condition where kse_wakeup() doesn't yet see them in (interruptible)
sleep queues. Also add an upcall check to sleepqueue_catch_signals()
suggested by jhb.

This commit should fix recent mysql hangs.

Reviewed by: jhb, davidxu
Mysql'd by: Robin P. Blanchard <robin.blanchard at gactr uga edu>


# 127794 03-Apr-2004 marcel

Assign thread IDs to kernel threads. The purpose of the thread ID (tid)
is twofold:
1. When a 1:1 or M:N threaded process dumps core, we need to put the
register state of each of its kernel threads in the core file.
This can only be done by differentiating the pid field in the
respective note. For this we need the tid.
2. When thread support is present for remote debugging the kernel
with gdb(1), threads need to be identified by an integer due to
limitations in the remote protocol. This requires having a tid.

To minimize the impact of having thread IDs, threads that are created
as part of a fork (i.e. the initial thread in a process) will inherit
the process ID (i.e. tid=pid). Subsequent threads will have IDs larger
than PID_MAX to avoid interference with the pid allocation algorithm.
The assignment of tids is handled by thread_new_tid().

The thread ID allocation algorithm has been written with 3 assumptions
in mind:
1. IDs need to be created as fast a possible,
2. Reuse of IDs may happen instantaneously,
3. Someone else will write a better algorithm.


# 127264 21-Mar-2004 julian

Massively up the (artificial) limit on system scope threads
in a process from 50 to 500

Also up the number of process scope threads allowed to be in the kernel
at one time from 150 to 1500 (per process)


# 126932 13-Mar-2004 peter

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 126469 01-Mar-2004 jhb

Check for TDF_SINTR before calling sleepq_abort() as there is a narrow
race in between sleepq_add() and sleepq_catch_signals() in that setting
td_wchan and TDF_SINTR is not atomic to sched_lock but only to the sleepq
lock. This band-aid will stop assertion failures, but there is perhaps a
larger problem with the sleepq_add/sleepq_catch_signals race that I am not
sure how to solve. For the signals case the race is harmless because we
always call cursig() after setting TDF_SINTR. However, KSE doesn't do
anything in sleepq_catch_signals() to check that this race was lost, so I
am unsure if this race is harmful for this specific abort.


# 126326 27-Feb-2004 jhb

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 125158 28-Jan-2004 jhb

Use mtx_assert() rather than using a home-rolled version.


# 124944 25-Jan-2004 jeff

- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.


# 124802 21-Jan-2004 rwatson

Reduce gratuitous includes: don't include jail.h if it's not needed.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.

Result of: self injury involving adding something to struct prison


# 124350 10-Jan-2004 schweikh

s/Muliple/Multiple
Removed whitespace at EOL and EOF.


# 123737 23-Dec-2003 peter

Don't use NULL (pointer) when we mean 0 (integer) for the number of ticks
in msleep.


# 123366 09-Dec-2003 marcel

Write the thread pointer (val) in the kse mailbox (loc) before we
set the new context in kse_switchin(2). This allows us to return
an error to the calling context when the suword() fails.


# 123252 07-Dec-2003 marcel

Add kse_switchin(2). This syscall can be used by KSE implementations
to have the kernel switch to a new thread, instead of doing it in
userland. It is in fact needed on ia64 where syscall restarts do not
return to userland first. It's completely handled inside the kernel.
As such, any context created by the kernel as part of an upcall and
caused by some syscall needs to be restored by the kernel.


# 123207 07-Dec-2003 alc

- Giant is no longer required by vm_thread_new().


# 122514 11-Nov-2003 jhb

Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly. These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool. Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads. The queue associated with a given lock
is found by a lookup in a simple hash table. The turnstile itself is
protected by a lock associated with its entry in the hash table. This
means that sched_lock is no longer needed to contest on a mutex. Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes. Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation. For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win). Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on: i386 SMP, sparc64 SMP, alpha SMP


# 119488 26-Aug-2003 davidxu

Let SA process work under ULE scheduler, originally it would panic kernel.

Reviewed by: jeff


# 119137 19-Aug-2003 sam

Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter. This
does not change the behaviour; it just makes the intent more clear.


# 118893 14-Aug-2003 grehan

Update powerpc to use the (old thread,new thread) calling convention
for cpu_throw() and cpu_switch().


# 118835 12-Aug-2003 jhb

- Convert Alpha over to the new calling conventions for cpu_throw() and
cpu_switch() where both the old and new threads are passed in as
arguments. Only powerpc uses the old conventions now.
- Update comments in the Alpha swtch.s to reflect KSE changes.

Tested by: obrien, marcel


# 118673 08-Aug-2003 deischen

Copyin the thread mailbox flags from the correct location
in the mailbox.


# 118607 07-Aug-2003 jhb

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


# 118486 05-Aug-2003 davidxu

Introduce a thread mailbox flag TMF_NOUPCALL. On some architectures other
than i386 or AMD64, TP register points to thread mailbox, and they can not
atomically clear km_curthread in kse mailbox, in this case, thread retrieves
its thread pointer from TP register and sets flag TMF_NOUPCALL in its thread
mailbox to indicate a critical region.


# 118442 04-Aug-2003 jhb

Set td_critnest to 1 when setting up a thread since it is a MI field with
MI values. This ensures that td_critnest for a newly fork'd thread is
always valid.

Requested by: bde (a long time ago)


# 117704 17-Jul-2003 davidxu

o Refine kse_thr_interrupt to allow it to handle different commands.
o Remove TDF_NOSIGPOST.
o Add a member td_waitset to proc structure, it will be used for sigwait.

Tested by: deischen


# 117637 15-Jul-2003 davidxu

If initial thread is still a bound thread, don't change its signal mask.


# 117607 15-Jul-2003 davidxu

Rename thread_siginfo to cpu_thread_siginfo


# 117212 03-Jul-2003 mtm

kse_thr_interrupt should target the thread, specifically.

Requested by: davidxu


# 117205 03-Jul-2003 mtm

Signals sent specifically to a particular thread must
be delivered to that thread, regardless of whether it
has it masked or not.

Previously, if the targeted thread had the signal masked,
it would be put on the processes' siglist. If
another thread has the signal umasked or unmasks it before
the target, then the thread it was intended for would never
receive it.

This patch attempts to solve the problem by requiring callers
of tdsignal() to say whether the signal is for the thread or
for the process. If it is for the process, then normal processing
occurs and any thread that has it unmasked can receive it.
But if it is destined for a specific thread, it is put on
that thread's pending list regardless of whether it is currently
masked or not.

The new behaviour still needs more work, though. If the signal
is reposted for some reason it is always posted back to the
thread that handled it because the information regarding the
target of the signal has been lost by then.

Reviewed by: jdp, jeff, bde (style)


# 117069 30-Jun-2003 davidxu

Fix typo.


# 117000 28-Jun-2003 marcel

Don't use fuword() and suword() on struct members of type int. This
happens to work on 32-bit platforms as sizeof(long)=sizeof(int), but
wrecks all kinds of havoc (garbage reads, corrupting writes and
misaligned loads/stores) on 64-bit architectures.
The fix for now is to use fuword32() and suword32() and change the
type of the applicable int fields to int32. This is to make it
explicit that we depend on these fields being 32-bit. We may want
to revisit this later.

Reviewed by: deischen


# 116963 28-Jun-2003 davidxu

o Change kse_thr_interrupt to allow send a signal to a specified thread,
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.

Reviewed by: julian (mentor)


# 116607 20-Jun-2003 davidxu

cpu_set_upcall_kse needs to access userspace, release schedule lock
before calling it for bound thread. To avoid this problem, change
thread_schedule_upcall to not put new thread on run queue, let caller
do it, so we can tweak the new thread before setting it to run.

Reported by: pho


# 116452 16-Jun-2003 davidxu

Forgot to commit code to disable creating a bound thread in same
group again except first kse_create syscall.

Noticed by: julian


# 116440 16-Jun-2003 davidxu

Reset ncpus to 1 for bound thread group since there is only one
thread in such group.
Change message text from kse_rel to kserel, it is better displayed
in top.


# 116401 15-Jun-2003 davidxu

1. Add code to support bound thread. when blocked, a bound thread never
schedules an upcall. Signal delivering to a bound thread is same as
non-threaded process. This is intended to be used by libpthread to
implement PTHREAD_SCOPE_SYSTEM thread.
2. Simplify kse_release() a bit, remove sleep loop.


# 116372 15-Jun-2003 davidxu

1. Migrate TDF_UPCALLING from td_flags to td_pflags.
2. Add a flag TDF_SA, it will be used to distinguish SA
based thread from bound thread.


# 116361 14-Jun-2003 davidxu

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 116355 14-Jun-2003 alc

Migrate the thread stack management functions from the machine-dependent
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.

Two details:

1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.

2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().


# 116184 10-Jun-2003 davidxu

Fix error in my last commit. Correctly maintain p_maxthrwaits and unlock
sched_lock.


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 116138 10-Jun-2003 davidxu

If there are signals delivered to current thread, breaks out of loop,
userret() will be called again by ast() and thread_userret() will be
called again by userret().

Reported by: tegge


# 115884 06-Jun-2003 davidxu

thread_signal_add now is called with ps_mtx held, unlock it before
calling copyin.


# 115858 04-Jun-2003 marcel

Change the second (and last) argument of cpu_set_upcall(). Previously
we were passing in a void* representing the PCB of the parent thread.
Now we pass a pointer to the parent thread itself.
The prime reason for this change is to allow cpu_set_upcall() to copy
(parts of) the trapframe instead of having it done in MI code in each
caller of cpu_set_upcall(). Copying the trapframe cannot always be
done with a simply bcopy() or may not always be optimal that way. On
ia64 specifically the trapframe contains information that is specific
to an entry into the kernel and can only be used by the corresponding
exit from the kernel. A trapframe copied verbatim from another frame
is in most cases useless without some additional normalization.

Note that this change removes the assignment to td->td_frame in some
implementations of cpu_set_upcall(). The assignment is redundant.
A previous call to cpu_thread_setup() already did the exact same
assignment. An added benefit of removing the redundant assignment is
that we can now change td_pcb without nasty side-effects.

This change officially marks the ability on ia64 for 1:1 threading.

Not tested on: amd64, powerpc
Compile & boot tested on: alpha, sparc64
Functionally tested on: i386, ia64


# 115790 03-Jun-2003 julian

Remove un-needed code.
Don't copyin() data we are about to overwrite.
Add a flag to tell userland that KSE is officially "DONE" with the
mailbox and has gone away.

Obtained from: davidxu@


# 115600 01-Jun-2003 marcel

Remove the ia64 hackery in threadinit() that was needed to work around
the lameness of the kstack code. The EPC overhaul de-lame-ified the
kstack code by removing the need for contigmalloc(). We can now
allocate stacks using malloc(). We probably want to make the stacks
swappable as well so that we can make it MI. But that's another story.


# 115549 31-May-2003 phk

Remove unused variable(s).

Found by: FlexeLint


# 115084 16-May-2003 marcel

Revamp of the syscall path, exception and context handling. The
prime objectives are:
o Implement a syscall path based on the epc inststruction (see
sys/ia64/ia64/syscall.s).
o Revisit the places were we need to save and restore registers
and define those contexts in terms of the register sets (see
sys/ia64/include/_regset.h).

Secundairy objectives:
o Remove the requirement to use contigmalloc for kernel stacks.
o Better handling of the high FP registers for SMP systems.
o Switch to the new cpu_switch() and cpu_throw() semantics.
o Add a good unwinder to reconstruct contexts for the rare
cases we need to (see sys/contrib/ia64/libuwx)

Many files are affected by this change. Functionally it boils
down to:
o The EPC syscall doesn't preserve registers it does not need
to preserve and places the arguments differently on the stack.
This affects libc and truss.
o The address of the kernel page directory (kptdir) had to
be unstaticized for use by the nested TLB fault handler.
The name has been changed to ia64_kptdir to avoid conflicts.
The renaming affects libkvm.
o The trapframe only contains the special registers and the
scratch registers. For syscalls using the EPC syscall path
no scratch registers are saved. This affects all places where
the trapframe is accessed. Most notably the unaligned access
handler, the signal delivery code and the debugger.
o Context switching only partly saves the special registers
and the preserved registers. This affects cpu_switch() and
triggered the move to the new semantics, which additionally
affects cpu_throw().
o The high FP registers are either in the PCB or on some
CPU. context switching for them is done lazily. This affects
trap().
o The mcontext has room for all registers, but not all of them
have to be defined in all cases. This mostly affects signal
delivery code now. The *context syscalls are as of yet still
unimplemented.

Many details went into the removal of the requirement to use
contigmalloc for kernel stacks. The details are mostly CPU
specific and limited to exception_save() and exception_restore().
The few places where we create, destroy or switch stacks were
mostly simplified by not having to construct physical addresses
and additionally saving the virtual addresses for later use.

Besides more efficient context saving and restoring, which of
course yields a noticable speedup, this also fixes the dreaded
SMP bootup problem as a side-effect. The details of which are
still not fully understood.

This change includes all the necessary backward compatibility
code to have it handle older userland binaries that use the
break instruction for syscalls. Support for break-based syscalls
has been pessimized in favor of a clean implementation. Due to
the overall better performance of the kernel, this will still
be notived as an improvement if it's noticed at all.

Approved by: re@ (jhb)


# 114400 01-May-2003 davidxu

Fix compiling problem, p_tracee is in my local repository for
threaded process debugging, not ready for this time.


# 114398 01-May-2003 davidxu

Drop Giant lock before suspended, pick up it after resumed.
thread_suspend_check() is used in exit1() which still needs
Giant lock.


# 114336 30-Apr-2003 peter

AMD64 uses the new-style cpu_switch()/cpu_throw() calling conventions.


# 114268 29-Apr-2003 davidxu

Increase some default values.


# 114106 27-Apr-2003 davidxu

unlock sched_lock at right time.


# 113998 24-Apr-2003 deischen

Add an argument to get_mcontext() which specified whether the
syscall return values should be cleared. The system calls
getcontext() and swapcontext() want to return 0 on success
but these contexts can be switched to at a later time so
the return values need to be cleared in the saved register
sets. Other callers of get_mcontext() would normally want
the context without clearing the return values.

Remove the i386-specific context saving from the KSE code.
get_mcontext() is not i386-specific any more.

Fix a bad pointer in the alpha get_mcontext() code. The
context was being bcopy()'d from &td->tf_frame, but tf_frame
is itself a pointer, so the thread was being copied instead.
Spotted by jake.

Glanced at by: jake
Reviewed by: bde (months ago)


# 113920 23-Apr-2003 jhb

- Protect p_numthreads with the sched_lock.
- Protect p_singlethread with both the sched_lock and the proc lock.
- Protect p_suspcount with the proc lock.


# 113864 22-Apr-2003 jhb

- Mark the kse_purge_group() and kse_purge() definitions static to match
their prototypes.
- Remove sched_lock locking from kse_purge() as all callers already lock
the sched_lock before calling it.
- Hold the proc lock slightly longer to protect P_SHOULDSTOP().


# 113795 21-Apr-2003 davidxu

Fix lock order reversal problem.


# 113793 21-Apr-2003 davidxu

Introduce two flags to control upcall behaviour:

o KMF_NOUPCALL
Ask kse_release to not return to userland upcall entry, but instead
direct returns to userland by using current thread's stack and return
address on stack. This flags is intended to be used by UTS in critical
region to wait another UTS thread to leave critical region, by using
kse_release with this flag to avoid spinnng and burning CPU. Also this
flags can be used by UTS to poll completed context when there is nothing
to do in userland and needn't restart from its entry like normal upcall.

o KMF_NOCOMPLETED
Ask kernel to not bring completed thread contexts back to userland when
doing upcall, this flags is intend to be used with above flag when an
upcall thread is in critical region and can not process completed contexts
at that time.

Tested by: deischen


# 113708 19-Apr-2003 davidxu

Test next upcall time correctly.


# 113705 19-Apr-2003 davidxu

Use correct thread pointer.


# 113686 18-Apr-2003 jhb

Use the proc lock to protect p_singlethread and a P_WEXIT test. This
fixes a couple of potential KSE panics on non-i386 arch's that weren't
holding the proc lock when calling thread_exit().


# 113641 17-Apr-2003 julian

Add a thread_unlink() and use it.
It could also be used twice in kern_thr.c but that's owned by jeff
so I'l let him change it when he's next there.


# 113626 17-Apr-2003 jhb

Protect td_sigmask with the proc lock.


# 113339 10-Apr-2003 julian

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 113244 08-Apr-2003 davidxu

Inherit blocked thread's context for upcall thread.


# 112993 02-Apr-2003 peter

Commit a partial lazy thread switch mechanism for i386. it isn't as lazy
as it could be and can do with some more cleanup. Currently its under
options LAZY_SWITCH. What this does is avoid %cr3 reloads for short
context switches that do not involve another user process. ie: we can
take an interrupt, switch to a kthread and return to the user without
explicitly flushing the tlb. However, this isn't as exciting as it could
be, the interrupt overhead is still high and too much blocks on Giant
still. There are some debug sysctls, for stats and for an on/off switch.

The main problem with doing this has been "what if the process that you're
running on exits while we're borrowing its address space?" - in this case
we use an IPI to give it a kick when we're about to reclaim the pmap.

Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a
few more things and get some more feedback before turning it on by default.

This is NOT a replacement for Bosko's lazy interrupt stuff. This was more
meant for the kthread case, while his was for interrupts. Mine helps a
little for interrupts, but his helps a lot more.

The stats are enabled with options SWTCH_OPTIM_STATS - this has been a
pseudo-option for years, I just added a bunch of stuff to it.

One non-trivial change was to select a new thread before calling
cpu_switch() in the first place. This allows us to catch the silly
case of doing a cpu_switch() to the current process. This happens
uncomfortably often. This simplifies a bit of the asm code in cpu_switch
(no longer have to call choosethread() in the middle). This has been
implemented on i386 and (thanks to jake) sparc64. The others will come
soon. This is actually seperate to the lazy switch stuff.

Glanced at by: jake, jhb


# 112910 31-Mar-2003 jeff

- Borrow the KSE single threading code for exec and exit. We use the check
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.


# 112888 31-Mar-2003 jeff

- Move p->p_sigmask to td->td_sigmask. Signal masks will be per thread with
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.


# 112750 28-Mar-2003 jhb

Check for the PS_NEEDSIGCHK flag in the right flags field.


# 112397 19-Mar-2003 davidxu

Adjust code for userland preemptive. Userland can set a quantum in
kse_mailbox to schedule an upcall, this is useful for userland timeout
routine, for example pthread_cond_timedwait().

Also extract upcall scheduling code from kse_reassign and create
a new function called thread_switchout to include these code.

Reviewed by: julain


# 112222 14-Mar-2003 davidxu

Export current time when returning from never blocked syscall.


# 112079 11-Mar-2003 davidxu

This is a force-commit for:
kern_sig.c 1.215
kern_thread.c 1.103
kern_exit.c 1.199
proc.h 1.302

Orignal code would suspend an already suspended thread,
if user presses ^Z while a threaded program is running. Also
there is a race between job control and thread_exit(), the
new code tests job control requesting before thread exits,
in wait() syscall, be sure to check child process is fully
stopped, this avoids a later SIGCHILD and returns STOPPED
status twice for a threaded child proc. A thread_stopped()
function is added for common code in several places.


# 112078 11-Mar-2003 davidxu

Lock proc lock before changing p_flag.


# 112077 11-Mar-2003 davidxu

Fix signal delivering bug for threaded process.


# 112071 10-Mar-2003 davidxu

Fix threaded process job control bug. SMP tested.

Reviewed by: julian


# 111976 08-Mar-2003 davidxu

Lock sched_lock before modifying td_flags.


# 111677 28-Feb-2003 davidxu

Check kse group limit before linking new ksegrp.


# 111595 27-Feb-2003 davidxu

Release sched_lock before calling upcall_free.


# 111585 27-Feb-2003 julian

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 111523 25-Feb-2003 davidxu

Add a missing '!'.


# 111515 25-Feb-2003 davidxu

Add a simple facility to allow round roubin in userland.

Reviewed by: julain


# 111465 25-Feb-2003 davidxu

Remove a bogus comment.


# 111394 23-Feb-2003 davidxu

Remove a XXXKSE. kg_completed now needs proc lock.


# 111387 23-Feb-2003 davidxu

Backout last surplus commit. That day just wasn't my day.


# 111207 21-Feb-2003 davidxu

If UTS kernel is calling kse_wakeup for itself, do nothing.


# 111170 20-Feb-2003 davidxu

Forgot to set KU_DOUPCALL in kse_wakeup.


# 111169 20-Feb-2003 davidxu

Add a timeout parameter to kse_release.


# 111154 19-Feb-2003 davidxu

Move thread limits testing code up a bit. This let UPCALLING thread
takes possible accumulated contexts away.


# 111129 19-Feb-2003 davidxu

Count non-threaded group.


# 111125 19-Feb-2003 davidxu

M_WAITOK and remove an useless comment.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 111115 19-Feb-2003 davidxu

Optimize the case when max threads number was hit.


# 111042 17-Feb-2003 davidxu

Further fix PS_NEEDSIGCHK


# 111041 17-Feb-2003 davidxu

Move code for detecting PS_NEEDSIGCHK into thread_schedule_upcall,
I think it is a better place to handle it.


# 111033 17-Feb-2003 jeff

- Add a new function, thread_signal_add(), that is called from postsig to
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.

Reviewed by: mini


# 111032 17-Feb-2003 julian

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# 111028 17-Feb-2003 jeff

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 110190 01-Feb-2003 julian

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 110161 31-Jan-2003 julian

Only add one tick per tick to the thread stats, instead of some random number.


# 109909 26-Jan-2003 davidxu

Use kg_numupcalls to see if we are closing a thread group,
not kg_kses which is not changed when a group is still working.


# 109877 26-Jan-2003 davidxu

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109550 20-Jan-2003 julian

Remove a KASSERT that can now happen and add a missing setrunnable.


# 108862 07-Jan-2003 davidxu

Check signals for idled threads.


# 108640 04-Jan-2003 davidxu

Set kse mailbox pointer to NULL when P_KSES is turned off.


# 108614 03-Jan-2003 julian

White space fixes


# 108613 03-Jan-2003 julian

Make an explicit flag to indicate that a KSE has a reason to upcall,
and use that flag when there is a kse_wakeup() call. It will probably
be used with signal delivery as well eventually.

Submitted by: davidxu@


# 108611 03-Jan-2003 julian

Don't need to set retvals to 0 in the non error case. They
are set to a good default anyhow.

Submitted by: davidxu@


# 108544 02-Jan-2003 davidxu

Adjust code for Julian's last commit. use td_mailbox to detect if
a syscall is from UTS kernel.


# 108338 27-Dec-2002 julian

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 107719 10-Dec-2002 julian

Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage
during the context switch. Rearrange thread cleanups
to avoid problems with Giant. Clean threads when freed or
when recycled.

Approved by: re (jhb)


# 107180 22-Nov-2002 mux

Under certain circumstances, we were calling kmem_free() from
i386 cpu_thread_exit(). This resulted in a panic with WITNESS
since we need to hold Giant to call kmem_free(), and we weren't
helding it anymore in cpu_thread_exit(). We now do this from a
new MD function, cpu_thread_dtor(), called by thread_dtor().

Approved by: re@
Suggested by: jhb


# 107126 20-Nov-2002 jeff

- Implement a mechanism for allowing schedulers to place scheduler dependant
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.

Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha


# 107060 18-Nov-2002 davidxu

Make sure only update wall clock at upcall time, slightly reformat
code in kse_relase().


# 107034 17-Nov-2002 davidxu

1. Support versioning and wall clock in kse mailbox,
also add rusage time in thread mailbox.
2. Minor change for thread limit code in thread_user_enter(),
fix typo in kse_release() last I committed.

Reviewed by: deischen, mini


# 107029 17-Nov-2002 julian

include smp.h.
it is required by some code that was commented out until david's
last commit.


# 107006 17-Nov-2002 davidxu

1.Add sysctls to control KSE resource allocation.
kern.threads.max_threads_per_proc
kern.threads.max_groups_per_proc
2.Temporary disable borrower thread stash itself as
owner thread's spare thread in thread_exit(). there
is a race between owner thread and borrow thread:
an owner thread may allocate a spare thread as this:
if (td->td_standin == NULL)
td->standin = thread_alloc();
but thread_alloc() can block the thread, then a borrower
thread would possible stash it self as owner's spare
thread in thread_exit(), after owner is resumed, result
is a thread leak in kernel, double check in owner can
avoid the race, but it may be ugly and not worth to do.


# 107003 17-Nov-2002 davidxu

Rework last exiting thread in kse_release(), wait a signal and then
schedule an upcall and call thread_exit().


# 106944 14-Nov-2002 davidxu

Return EWOULDBLOCK for last thread in kse_release().

Requested by: archie


# 106903 14-Nov-2002 davidxu

In kse_release(), check if current thread is bound
and current kse mailbox was already initialized, also
prevent last thread from exiting unless we figure out
how to safely support null thread proc.


# 106242 31-Oct-2002 davidxu

KSE-enabled processes only.


# 106188 30-Oct-2002 davidxu

Check NULL thread mailbox pointer.


# 106182 30-Oct-2002 davidxu

Style fixes.


# 106181 30-Oct-2002 davidxu

Don't forget to set syscall result.


# 106180 30-Oct-2002 davidxu

Add an actual implementation of kse_thr_interrupt()


# 106075 28-Oct-2002 davidxu

Close a race window in kse_create(): signal delivered after SIGPENDING call
but before we call kse_link().


# 105974 26-Oct-2002 julian

iBack out david's last commit. the suspension code needs to be called
for non KSE processes too.


# 105972 26-Oct-2002 davidxu

Move suspension checking code from userret() into thread_userret().


# 105970 25-Oct-2002 davidxu

Backout revision 1.48.


# 105931 25-Oct-2002 davidxu

suspend thread only when it can be interrupted.


# 105930 25-Oct-2002 davidxu

let thread_schedule_upcall() handle idle kse.


# 105912 25-Oct-2002 julian

fix style-o


# 105911 25-Oct-2002 julian

More work on the interaction between suspending and sleeping threads.
Also clean up some code used with 'single-threading'.

Reviewed by: davidxu


# 105901 24-Oct-2002 davidxu

fix typo.


# 105900 24-Oct-2002 julian

Extract out KSE specific code from machine specific code
so that there is ony one copy of it. Fix that one copy
so that KSEs with no mailbox in a KSE program are not a cause
of page faults (this can legitmatly happen).

Submitted by: (parts) davidxu


# 105874 24-Oct-2002 davidxu

respect TDF_SINTR, also for SINGLE_NO_EXIT threading mode, if a thread
was already suspended, do nothing.


# 105855 24-Oct-2002 davidxu

don't forget to remove kse from idle queue.


# 105854 24-Oct-2002 julian

Move thread related code from kern_proc.c to kern_thread.c.
Add code to free KSEs and KSEGRPs on exit.
Sort KSE prototypes in proc.h.
Add the missing kse_exit() syscall.

ksetest now does not leak KSEs and KSEGRPS.

Submitted by: (parts) davidxu


# 104695 09-Oct-2002 julian

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 104503 05-Oct-2002 jmallett

Put an easy-to-miss assignment into the proper place. It was stray in the
middle of a block of code, with no clear assignment. While here, move one
nearby assignment out of declaration.


# 104502 05-Oct-2002 jmallett

Remove bogus duplicate assignment of local variables.


# 104437 03-Oct-2002 peter

Add some unspeakable hackery to the tree under #ifdef __ia64__ to work
around limitations in the ia64 kernel stack handling code. Basically
preallocate a bunch of threads (and hence kstacks) while contigmalloc()
still works, and never free them back to the general memory pool. After
the system has been running for a while, contigmalloc() eventually fails
at a critical momemt and panics the system.


# 104354 02-Oct-2002 scottl

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


# 104157 29-Sep-2002 julian

Implement basic KSE loaning. This stops a hread that is blocked in BOUND mode
from stopping another thread from completing a syscall, and this allows it to
release its resources etc. Probably more related commits to follow (at least
one I know of)

Initial concept by: julian, dillon
Submitted by: davidxu


# 104126 29-Sep-2002 julian

lock proc while calling psignal
(plus related cleanups)

Submitted by: davidxu


# 104031 27-Sep-2002 julian

Redo how completing threads pass their state to userland
if they are not going to cross over themselves. Also change how the list of
completed user threads is tracked and passed to the KSE. This is not
a change in design but rather the implementation of what was originally
envisionned.


# 103972 25-Sep-2002 archie

Make the following name changes to KSE related functions, etc., to better
represent their purpose and minimize namespace conflicts:

kse_fn_t -> kse_func_t
struct thread_mailbox -> struct kse_thr_mailbox
thread_interrupt() -> kse_thr_interrupt()
kse_yield() -> kse_release()
kse_new() -> kse_create()

Add missing declaration of kse_thr_interrupt() to <sys/kse.h>.
Regenerate the various generated syscall files. Minor style fixes.

Reviewed by: julian


# 103859 23-Sep-2002 julian

Don't use local variable 'p' in a debug statement.. we removed it.


# 103838 23-Sep-2002 julian

slightly clean up the thread_userret() and thread_consider_upcall() calls.
also some slight changes for TDF_BOUND testing and small style changes
Should ONLY affect KSE programs

Submitted by: davidxu


# 103464 17-Sep-2002 peter

Argh. I've been reading makefiles for too long. Change comment to a
C-style comment.


# 103463 17-Sep-2002 peter

Stub out the calls to get_mcontext and set_mcontext which only exist on
i386. This stuff should not be prototyped in MD inludes if the interface
is expected to be MI.


# 103410 16-Sep-2002 mini

Add kernel support needed for the KSE-aware libpthread:
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.

Reviewed by: bde, deischen, julian
Approved by: -arch


# 103367 15-Sep-2002 julian

Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.


# 103312 14-Sep-2002 julian

Apparently something down in the guts of vm/uvm still needs giant

Obtained from: mini via P4 KSE tree.


# 103216 11-Sep-2002 julian

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 103072 07-Sep-2002 julian

fix braino..
was clearing part of wrong thread structure..


# 103055 06-Sep-2002 julian

fix misplaced schedlock

Submitted by: davidxu@freebsd.org


# 103002 06-Sep-2002 julian

Use UMA as a complex object allocator.
The process allocator now caches and hands out complete process structures
*including substructures* .

i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.

For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.

Reviewed by: davidxu@freebsd.org, peter@freebsd.org


# 102950 05-Sep-2002 davidxu

s/SGNL/SIG/
s/SNGL/SINGLE/
s/SNGLE/SINGLE/

Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.

Approved by: julian (mentor)


# 102898 03-Sep-2002 davidxu

In the kernel code, we have the tsleep() call with the PCATCH argument.
PCATCH means 'if we get a signal, interrupt me!" and tsleep returns
either EINTR or ERESTART depending on the circumstances. ERESTART is
"special" because it causes the system call to fail, but right as it
returns back to userland it tells the trap handler to move %eip back a
bit so that userland will immediately re-run the syscall.
This is a syscall restart. It only works for things like read() etc where
nothing has changed yet. Note that *userland* is tricked into restarting
the syscall by the kernel. The kernel doesn't actually do the restart. It
is deadly for things like select, poll, nanosleep etc where it might cause
the elapsed time to be reset and start again from scratch. So those
syscalls do this to prevent userland rerunning the syscall:
if (error == ERESTART) error = EINTR;

Fake "signals" like SIGTSTP from ^Z etc do not normally invoke userland
signal handlers. But, in -current, the PCATCH *is* being triggered and
tsleep is returning ERESTART, and the syscall is aborted even though no
userland signal handler was run.
That is the fault here. We're triggering the PCATCH in cases that we
shouldn't. ie: it is being triggered on *any* signal processing, rather
than the case where the signal is posted to userland.
--- Peter

The work of psignal() is a patchwork of special case required by the process
debugging and job-control facilities...
--- Kirk McKusick
"The design and impelementation of the 4.4BSD Operating system"
Page 105

in STABLE source, when psignal is posting a STOP signal to sleeping
process and the signal action of the process is SIG_DFL, system will
directly change the process state from SSLEEP to SSTOP, and when
SIGCONT is posted to the stopped process, if it finds that the process
is still on sleep queue, the process state will be restored to SSLEEP,
and won't wakeup the process.

this commit mimics the behaviour in STABLE source tree.

Reviewed by: Jon Mini, Tim Robbins, Peter Wemm
Approved by: julian@freebsd.org (mentor)


# 102581 29-Aug-2002 julian

Fix crack-smoking code that was panicing on the quad xeon:
- If either of proc or kse are NULL during thread_exit(), then
the kernel is going to fault because parts of the function
assume they aren't NULL. Instead, just assert they aren't NULL
(as well as the kse group) and assume they are in all of the
code. It doesn't make sense for them to be NULL here anyways.
- Move the PROC_UNLOCK(p) up above clearing td_proc, etc. since
otherwise we will panic if the proc's lock is contested.

Submitted by: jhb@freebsd.org


# 102292 22-Aug-2002 julian

slight cleanup of single-threading code for KSE processes


# 102238 21-Aug-2002 julian

Revert some suspension/sleep/signal code from KSE-III
We need to rethink a bit of this and it doesn't matter if
we break the KSE test program for now as long
as non-KSE programs act as expected.

Submitted by: David Xu <bsddiy@yahoo.com>
(this guy's just asking to get hit with a commit bit..)


# 101177 01-Aug-2002 julian

Fix a comment.


# 100653 25-Jul-2002 julian

get suspension counting right.
fix an error message

Submitted by: David Xu <bsddiy@yahoo.com>


# 100648 24-Jul-2002 julian

fix some style problems and remove a mis-merged assert.


# 100646 24-Jul-2002 julian

Add some locking asserts and some comments


# 100632 24-Jul-2002 julian

When single threading a multithreaded program, awaken the
'single threading thread' when the last other thread suspends.
I had this code in there before but it seems to have been
accidentally deleted somewhere along the way. This would only affect
multithreaded processes.

Reviewed by: David Xu <bsddiy@yahoo.com>


# 100594 24-Jul-2002 julian

When suspending a thread, update the appropriate (sic) statistic.


# 100273 17-Jul-2002 peter

ia64 does not have the same degree of stealth include file nesting,
so it needs an explicit #include <machine/frame.h> to get 'struct
trapframe'. The fact that it needs this at this level is rather bogus
but it will not compile without it.


# 100271 17-Jul-2002 peter

Pacify gcc on ia64


# 99942 14-Jul-2002 julian

Thinking about it I came to the conclusion that the KSE states were incorrectly
formulated. The correct states should be:
IDLE: On the idle KSE list for that KSEG
RUNQ: Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.

This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.

Reviewed by: jhb@freebsd.org


# 99559 07-Jul-2002 peter

Collect all the (now equivalent) pmap_new_proc/pmap_dispose_proc/
pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code
and use a single equivalent MI version. There are other cleanups
needed still.

While here, use the UMA zone hooks to keep a cache of preinitialized
proc structures handy, just like the thread system does. This eliminates
one dependency on 'struct proc' being persistent even after being freed.
There are some comments about things that can be factored out into
ctor/dtor functions if it is worth it. For now they are mostly just
doing statistics to get a feel of how it is working.


# 99552 07-Jul-2002 peter

Make this compile on 64 bit platforms


# 99198 01-Jul-2002 arr

- In thread_userret(), remove the Giant locking and unlocking around the
call to thread_alloc().

Approved by: julian
Reviewed by: jake, jeff


# 99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 99026 29-Jun-2002 julian

Add files that are new for KSE.