Cross Reference: /freebsd-11-stable/sys/kern/kern

History log of /freebsd-11-stable/sys/kern/kern_synch.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 359652	06-Apr-2020	hselasky	MFC r333806: Use NULL for SYSINIT's last arg, which is a pointer type Sponsored by: The FreeBSD Foundation
# 354405	06-Nov-2019	mav	MFC r349220: Add wakeup_any(), cheaper wakeup_one() for taskqueue(9). wakeup_one() and underlying sleepq_signal() spend additional time trying to be fair, waking thread with highest priority, sleeping longest time. But in case of taskqueue there are many absolutely identical threads, and any fairness between them is quite pointless. It makes even worse, since round-robin wakeups not only make previous CPU affinity in scheduler quite useless, but also hide from user chance to see CPU bottlenecks, when sequential workload with one request at a time looks evenly distributed between multiple threads. This change adds new SLEEPQ_UNFAIR flag to sleepq_signal(), making it wakeup thread that went to sleep last, but no longer in context switch (to avoid immediate spinning on the thread lock). On top of that new wakeup_any() function is added, equivalent to wakeup_one(), but setting the flag. On top of that taskqueue(9) is switchied to wakeup_any() to wakeup its threads. As result, on 72-core Xeon v4 machine sequential ZFS write to 12 ZVOLs with 16KB block size spend 34% less time in wakeup_any() and descendants then it was spending in wakeup_one(), and total write throughput increased by ~10% with the same as before CPU usage.
# 331722	29-Mar-2018	eadler	Revert r330897: This was intended to be a non-functional change. It wasn't. The commit message was thus wrong. In addition it broke arm, and merged crypto related code. Revert with prejudice. This revert skips files touched in r316370 since that commit was since MFCed. This revert also skips files that require $FreeBSD$ property changes. Thank you to those who helped me get out of this mess including but not limited to gonzo, kevans, rgrimes. Requested by: gjb (re)
# 330897	14-Mar-2018	eadler	Partial merge of the SPDX changes These changes are incomplete but are making it difficult to determine what other changes can/should be merged. No objections from: pfg
# 330852	13-Mar-2018	hselasky	MFC r330344: Correct the return code from pause() during cold startup from zero to EWOULDBLOCK. This also matches the description in pause(9). Discussed with: kib@ Sponsored by: Mellanox Technologies
# 330850	13-Mar-2018	hselasky	MFC r330349 and r330362: Allow pause_sbt() to catch signals during sleep by passing C_CATCH flag. Define pause_sig() function macro helper similarly to other kernel functions which catch signals. Update outdated function description. Document pause_sig(9) and update prototypes for existing pause(9) and pause_sbt(9) functions. Discussed with: kib@ Suggested by: cem@ Sponsored by: Mellanox Technologies
# 316842	14-Apr-2017	avg	MFC r315960: dtrace sched:::preempt should fire only when there is preemption
# 316840	14-Apr-2017	avg	MFC r315851: move thread switch tracing from mi_switch to sched_switch
# 315386	16-Mar-2017	mjg	MFC r313853,r313859: locks: remove SCHEDULER_STOPPED checks from primitives for modules They all fallback to the slow path if necessary and the check is there. This means a panicked kernel executing code from modules will be able to succeed doing actual lock/unlock, but this was already the case for core code which has said primitives inlined. == Introduce SCHEDULER_STOPPED_TD for use when the thread pointer was already read Sprinkle in few places.
# 310981	31-Dec-2016	jhb	MFC 310336: Don't spin in pause() during early boot for kthreads other than thread0. pause() uses a spin loop to simulate a sleep during early boot. However, we only need this for thread0 to get far enough in the boot process to enable timers (at which point pause() can sleep). For other kthreads, sleeping in pause() is ok as the callout will be scheduled and will eventually fire once thread0 initializes timers.
# 308468	09-Nov-2016	kib	MFC r308228: Remove remnants of the recursive sleep support.
# 302408	07-Jul-2016	gjb	Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here. Additional commits post-branch will follow. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-11-stable/MAINTAINERS /freebsd-11-stable/cddl /freebsd-11-stable/cddl/contrib/opensolaris /freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print /freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs /freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs /freebsd-11-stable/contrib/amd /freebsd-11-stable/contrib/apr /freebsd-11-stable/contrib/apr-util /freebsd-11-stable/contrib/atf /freebsd-11-stable/contrib/binutils /freebsd-11-stable/contrib/bmake /freebsd-11-stable/contrib/byacc /freebsd-11-stable/contrib/bzip2 /freebsd-11-stable/contrib/com_err /freebsd-11-stable/contrib/compiler-rt /freebsd-11-stable/contrib/dialog /freebsd-11-stable/contrib/dma /freebsd-11-stable/contrib/dtc /freebsd-11-stable/contrib/ee /freebsd-11-stable/contrib/elftoolchain /freebsd-11-stable/contrib/elftoolchain/ar /freebsd-11-stable/contrib/elftoolchain/brandelf /freebsd-11-stable/contrib/elftoolchain/elfdump /freebsd-11-stable/contrib/expat /freebsd-11-stable/contrib/file /freebsd-11-stable/contrib/gcc /freebsd-11-stable/contrib/gcclibs/libgomp /freebsd-11-stable/contrib/gdb /freebsd-11-stable/contrib/gdtoa /freebsd-11-stable/contrib/groff /freebsd-11-stable/contrib/ipfilter /freebsd-11-stable/contrib/ldns /freebsd-11-stable/contrib/ldns-host /freebsd-11-stable/contrib/less /freebsd-11-stable/contrib/libarchive /freebsd-11-stable/contrib/libarchive/cpio /freebsd-11-stable/contrib/libarchive/libarchive /freebsd-11-stable/contrib/libarchive/libarchive_fe /freebsd-11-stable/contrib/libarchive/tar /freebsd-11-stable/contrib/libc++ /freebsd-11-stable/contrib/libc-vis /freebsd-11-stable/contrib/libcxxrt /freebsd-11-stable/contrib/libexecinfo /freebsd-11-stable/contrib/libpcap /freebsd-11-stable/contrib/libstdc++ /freebsd-11-stable/contrib/libucl /freebsd-11-stable/contrib/libxo /freebsd-11-stable/contrib/llvm /freebsd-11-stable/contrib/llvm/projects/libunwind /freebsd-11-stable/contrib/llvm/tools/clang /freebsd-11-stable/contrib/llvm/tools/lldb /freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump /freebsd-11-stable/contrib/llvm/tools/llvm-lto /freebsd-11-stable/contrib/mdocml /freebsd-11-stable/contrib/mtree /freebsd-11-stable/contrib/ncurses /freebsd-11-stable/contrib/netcat /freebsd-11-stable/contrib/ntp /freebsd-11-stable/contrib/nvi /freebsd-11-stable/contrib/one-true-awk /freebsd-11-stable/contrib/openbsm /freebsd-11-stable/contrib/openpam /freebsd-11-stable/contrib/openresolv /freebsd-11-stable/contrib/pf /freebsd-11-stable/contrib/sendmail /freebsd-11-stable/contrib/serf /freebsd-11-stable/contrib/sqlite3 /freebsd-11-stable/contrib/subversion /freebsd-11-stable/contrib/tcpdump /freebsd-11-stable/contrib/tcsh /freebsd-11-stable/contrib/tnftp /freebsd-11-stable/contrib/top /freebsd-11-stable/contrib/top/install-sh /freebsd-11-stable/contrib/tzcode/stdtime /freebsd-11-stable/contrib/tzcode/zic /freebsd-11-stable/contrib/tzdata /freebsd-11-stable/contrib/unbound /freebsd-11-stable/contrib/vis /freebsd-11-stable/contrib/wpa /freebsd-11-stable/contrib/xz /freebsd-11-stable/crypto/heimdal /freebsd-11-stable/crypto/openssh /freebsd-11-stable/crypto/openssl /freebsd-11-stable/gnu/lib /freebsd-11-stable/gnu/usr.bin/binutils /freebsd-11-stable/gnu/usr.bin/cc/cc_tools /freebsd-11-stable/gnu/usr.bin/gdb /freebsd-11-stable/lib/libc/locale/ascii.c /freebsd-11-stable/sys/cddl/contrib/opensolaris /freebsd-11-stable/sys/contrib/dev/acpica /freebsd-11-stable/sys/contrib/ipfilter /freebsd-11-stable/sys/contrib/libfdt /freebsd-11-stable/sys/contrib/octeon-sdk /freebsd-11-stable/sys/contrib/x86emu /freebsd-11-stable/sys/contrib/xz-embedded /freebsd-11-stable/usr.sbin/bhyve/atkbdc.h /freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c /freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h /freebsd-11-stable/usr.sbin/bhyve/console.c /freebsd-11-stable/usr.sbin/bhyve/console.h /freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c /freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c /freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h /freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c /freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h /freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c /freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h /freebsd-11-stable/usr.sbin/bhyve/rfb.c /freebsd-11-stable/usr.sbin/bhyve/rfb.h /freebsd-11-stable/usr.sbin/bhyve/sockstream.c /freebsd-11-stable/usr.sbin/bhyve/sockstream.h /freebsd-11-stable/usr.sbin/bhyve/usb_emul.c /freebsd-11-stable/usr.sbin/bhyve/usb_emul.h /freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c /freebsd-11-stable/usr.sbin/bhyve/vga.c /freebsd-11-stable/usr.sbin/bhyve/vga.h
# 301456	05-Jun-2016	kib	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711
# 300489	23-May-2016	hselasky	Use DELAY() instead of _sleep() when SCHEDULER_STOPPED() is set inside pause_sbt(). This allows pause() to continue working during a panic() which is not invoking KDB. This is useful when debugging graphics drivers using the LinuxKPI. Obtained from: kmacy @ MFC after: 1 week
# 298649	26-Apr-2016	pfg	sys: extend use of the howmany() macro when available. We have a howmany() macro in the <sys/param.h> header that is convenient to re-use as it makes things easier to read.
# 297466	31-Mar-2016	jhb	Rework handling of thread sleeps before timers are working. Previously, calls to sleep() and cv_wait*() immediately returned during early boot. Instead, permit threads that request a sleep without a timeout to sleep as wakeup() works during early boot. Sleeps with timeouts are harder to emulate without working timers, so just punt and panic explicitly if any thread tries to use those before timers are working. Any threads that depend on timeouts should either wait until SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers are available. Until APs are started earlier this should be a no-op as other kthreads shouldn't get a chance to start running until after timers are working regardless of when they were created. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D5724
# 297273	25-Mar-2016	cem	Add td_swinvoltick to track last involuntary context switch Expose in DDB via "show thread." Reviewed by: markj Sponsored by: EMC / Isilon Storage Division
# 283735	29-May-2015	kib	Remove several write-only variables, all reported by the gcc 4.9 buildkernel run. Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access. Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 283291	22-May-2015	jkim	CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks
# 282274	30-Apr-2015	jhb	Remove support for Xen PV domU kernels. Support for HVM domU kernels remains. Xen is planning to phase out support for PV upstream since it is harder to maintain and has more overhead. Modern x86 CPUs include virtualization extensions that support HVM guests instead of PV guests. In addition, the PV code was i386 only and not as well maintained recently as the HVM code. - Remove the i386-only NATIVE option that was used to disable certain components for PV kernels. These components are now standard as they are on amd64. - Remove !XENHVM bits from PV drivers. - Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3, etc.) - Remove duplicate copy of <xen/features.h>. - Remove unused, i386-only xenstored.h. Differential Revision: https://reviews.freebsd.org/D2362 Reviewed by: royger Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0) Relnotes: yes
# 281702	18-Apr-2015	markj	Remove unimplemented sched provider probes. They were added for compatibility with the sched provider in Solaris and illumos, but our sched provider is already incompatible since it uses native types, so there isn't much point in keeping them around. Differential Revision: https://reviews.freebsd.org/D2167 Reviewed by: rpaulo
# 277528	22-Jan-2015	hselasky	Revert for r277213: FreeBSD developers need more time to review patches in the surrounding areas like the TCP stack which are using MPSAFE callouts to restore distribution of callouts on multiple CPUs. Bump the __FreeBSD_version instead of reverting it. Suggested by: kmacy, adrian, glebius and kib Differential Revision: https://reviews.freebsd.org/D1438
# 277213	15-Jan-2015	hselasky	Major callout subsystem cleanup and rewrite: - Close a migration race where callout_reset() failed to set the CALLOUT_ACTIVE flag. - Callout callback functions are now allowed to be protected by spinlocks. - Switching the callout CPU number cannot always be done on a per-callout basis. See the updated timeout(9) manual page for more information. - The timeout(9) manual page has been updated to reflect how all the functions inside the callout API are working. The manual page has been made function oriented to make it easier to deduce how each of the functions making up the callout API are working without having to first read the whole manual page. Group all functions into a handful of sections which should give a quick top-level overview when the different functions should be used. - The CALLOUT_SHAREDLOCK flag and its functionality has been removed to reduce the complexity in the callout code and to avoid problems about atomically stopping callouts via callout_stop(). If someone needs it, it can be re-added. From my quick grep there are no CALLOUT_SHAREDLOCK clients in the kernel. - A new callout API function named "callout_drain_async()" has been added. See the updated timeout(9) manual page for a complete description. - Update the callout clients in the "kern/" folder to use the callout API properly, like cv_timedwait(). Previously there was some custom sleepqueue code in the callout subsystem, which has been removed, because we now allow callouts to be protected by spinlocks. This allows us to tear down the callout like done with regular mutexes, and a "td_slpmutex" has been added to "struct thread" to atomically teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and "SWT_SLEEPQTIMO" states can now be completely removed. Currently they are marked as available and will be cleaned up in a follow up commit. - Bump the __FreeBSD_version to indicate kernel modules need recompilation. - There has been several reports that this patch "seems to squash a serious bug leading to a callout timeout and panic". Kernel build testing: all architectures were built MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D1438 Sponsored by: Mellanox Technologies Reviewed by: jhb, adrian, sbruno and emaste
# 273377	21-Oct-2014	hselasky	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies
# 271253	08-Sep-2014	dumbbell	pause_sbt(): Take the cold path (ie. use DELAY()) if KDB is active This fixes a panic in the i915 driver when one uses debug.kdb.enter=1 under vt(4). PR: 193269 Reported by: emaste@ Submitted by: avg@ MFC after: 3 days
# 258648	26-Nov-2013	avg	use saner calculations in should_yield This is based on feedback from bde. MFC after: 6 days
# 258622	26-Nov-2013	avg	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
# 258541	25-Nov-2013	attilio	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
# 255835	24-Sep-2013	mav	Make load average sampling asynchronous to hardclock ticks. This improves measurement of load caused by time-related events still using hardclock. For example, without this change dummynet, scheduling events each hardclock tick, was always miscounted as load of 1. There is still aliasing with events delayed by the new precision mechanism, but it probably can't be avoided without moving this sampling from using callout to some lower-level code or handling it in some other special way. Reviewed by: davide Approved by: re (marius)
# 255745	20-Sep-2013	davide	Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With current lock classes KPI it was really difficult because there was no way to pass an rmtracker object to the lock/unlock routines. In order to accomplish the task, modify the aforementioned functions so that they can return (or pass as argument) an uinptr_t, which is in the rm case used to hold a pointer to struct rm_priotracker for current thread. As an added bonus, this fixes rm_sleep() in the rm shared case, which right now can communicate priotracker structure between lc_unlock()/lc_lock(). Suggested by: jhb Reviewed by: jhb Approved by: re (delphij)
# 255067	30-Aug-2013	hselasky	Simplify pause_sbt() logic. Don't call DELAY() if remainder is less than or equal to zero.
# 254167	09-Aug-2013	cognet	Don't call sleepinit() from proc0_init(), make it a SYSINIT instead. vmem needs the sleepq locks to be initialized when free'ing kva, so we want it called as early as possible.
# 253077	09-Jul-2013	avg	should_yield: protect from td_swvoltick being uninitialized or too stale The distance between ticks and td_swvoltick should be calculated as an unsigned number. Previously we could end up comparing a negative number with hogticks in which case should_yield() would give incorrect answer. We should probably ensure that td_swvoltick is properly initialized. Sponsored by: HybridCluster MFC after: 5 days
# 252357	28-Jun-2013	davide	Correct the comment above _sleep() function which still mentions 'timo' instead of 'sbintime_t'. Reported by: kan
# 248470	18-Mar-2013	jhb	Partially revert r195702. Deferring stops is now implemented via a set of calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep calls. - Remove the stop_allowed parameters from cursig() and issignal(). issignal() checks TDF_SBDRY directly. - Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
# 248186	12-Mar-2013	mav	Make kern_nanosleep() and pause_sbt() to use per-CPU sleep queues. This removes significant sleep queue lock congestion on multithreaded microbenchmarks, making them scale to multiple CPUs almost linearly.
# 247787	04-Mar-2013	davide	MFcalloutng: Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, markj, Fabian Keil
# 247778	04-Mar-2013	davide	MFcalloutng (r244355): Make loadavg calculation callout direct. There are several reasons for it: - it is very simple and doesn't worth context switch to SWI; - since SWI is no longer used here, we can remove twelve years old hack, excluding this SWI from from the loadavg statistics; - it fixes problem when eventtimer (HPET) shares interrupt with some other device, and that interrupt thread counted as permanent loadavg of 1; now loadavg accounted before that interrupt thread is scheduled. Sponsored by: Google Summer of Code 2012, iXsystems inc. Tested by: flo, marius, ian, Fabian Keil, markj
# 245049	04-Jan-2013	bjk	Fix some minor inaccuracies introduced in r243251. Also correct the comment in kern_synch.c which was the source of the problematic text. Reviewed by: kib (previous version) Approved by: hrs (mentor)
# 235459	14-May-2012	rstone	Implement the DTrace sched provider. This implementation aims to be compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
# 234494	20-Apr-2012	jhb	Include the associated wait channel message for context switch ktrace records. kdump supports both the old and new messages. Submitted by: Andrey Zonov andrey zonov org MFC after: 1 week
# 228424	11-Dec-2011	avg	panic: add a switch and infrastructure for stopping other CPUs in SMP case Historical behavior of letting other CPUs merily go on is a default for time being. The new behavior can be switched on via kern.stop_scheduler_on_panic tunable and sysctl. Stopping of the CPUs has (at least) the following benefits: - more of the system state at panic time is preserved intact - threads and interrupts do not interfere with dumping of the system state Only one thread runs uninterrupted after panic if stop_scheduler_on_panic is set. That thread might call code that is also used in normal context and that code might use locks to prevent concurrent execution of certain parts. Those locks might be held by the stopped threads and would never be released. To work around this issue, it was decided that instead of explicit checks for panic context, we would rather put those checks inside the locking primitives. This change has substantial portions written and re-written by attilio and kib at various times. Other changes are heavily based on the ideas and patches submitted by jhb and mdf. bde has provided many insights into the details and history of the current code. The new behavior may cause problems for systems that use a USB keyboard for interfacing with system console. This is because of some unusual locking patterns in the ukbd code which have to be used because on one hand ukbd is below syscons, but on the other hand it has to interface with other usb code that uses regular mutexes/Giant for its concurrency protection. Dumping to USB-connected disks may also be affected. PR: amd64/139614 (at least) In cooperation with: attilio, jhb, kib, mdf Discussed with: arch@, bde Tested by: Eugene Grosbein <eugen@grosbein.net>, gnn, Steven Hartland <killing@multiplay.co.uk>, glebius, Andrew Boyer <aboyer@averesystems.com> (various versions of the patch) MFC after: 3 months (or never)
# 228234	03-Dec-2011	hselasky	Make sure the description of pause() is equivalent to its implementation. No code change. Suggested by: Bruce Evans MFC after: 3 days
# 227749	20-Nov-2011	hselasky	Given that the typical usage of pause() is pause("zzz", hz / N), where N can be greater than hz in some cases, simply ignore a timeout value of zero. Suggested by: Bruce Evans MFC after: 1 week
# 227748	20-Nov-2011	hselasky	Minor style change: Simplify the description of pause() and shorten the KASSERT message in pause. Also add a clamp for the timo argument in the non-KASSERT case. Suggested by: Bruce Evans MFC after: 1 week
# 227706	19-Nov-2011	hselasky	Simplify the usb_pause_mtx() function by factoring out the generic parts to the kernel's pause() function. The pause() function can now be used when cold != 0. Also assert that the timeout in system ticks must be positive. Suggested by: Bruce Evans MFC after: 1 week
# 225617	16-Sep-2011	kmacy	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# 222252	24-May-2011	jhb	Simplify a stale assertion. We have not called mi_switch() from a nested critical section during a preemption for several years. MFC after: 1 week
# 221829	13-May-2011	mdf	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
# 218424	07-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 217077	06-Jan-2011	jhb	Only change the priority of timeshare threads to PRI_MAX_TIMESHARE when yield() is called. Specifically, leave the priority of real time and idle threads unchanged. MFC after: 2 weeks
# 195702	14-Jul-2009	kib	Add new msleep(9) flag PBDY that shall be specified together with PCATCH, to indicate that thread shall not be stopped upon receipt of SIGSTOP until it reaches the kernel->usermode boundary. Also change thread_single(SINGLE_NO_EXIT) to only stop threads at the user boundary unconditionally. Tested by: pho Reviewed by: jhb Approved by: re (kensmith)
# 195700	14-Jul-2009	kib	When wakeup(9) is going to notify swapper, assert that wait channel is not equal to &proc0. It shall be not, since proc0 stack is not swappable, and kick_proc0() is wakeup(&proc0). Reviewed by: jhb Approved by: re (kensmith)
# 189074	26-Feb-2009	ed	Remove even more unneeded variable assignments. kern_time.c: - Unused variable `p'. kern_thr.c: - Variable `error' is always caught immediately, so no reason to initialize it. There is no way that error != 0 at the end of create_thread(). kern_sig.c: - Unused variable `code'. kern_synch.c: - `rval' is always assigned in all different cases. kern_rwlock.c: - `v' is always overwritten with RW_UNLOCKED further on. kern_malloc.c: - `size' is always initialized with the proper value before being used. kern_exit.c: - `error' is always caught and returned immediately. abort2() never returns a non-zero value. kern_exec.c: - `len' is always assigned inside the if-statement right below it. tty_info.c: - `td' is always overwritten by FOREACH_THREAD_IN_PROC(). Found by: LLVM's scan-build
# 187357	17-Jan-2009	jeff	- Implement generic macros for producing KTR records that are compatible with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
# 184042	18-Oct-2008	kmacy	- Forward port flush of page table updates on context switch or userret - Forward port vfork XEN hack
# 183352	25-Sep-2008	jhb	- Don't do a WITNESS_SAVE() on the interlock if it is Giant in the condition variable wait routines. DROP_GIANT() already manages that state in the Giant interlock case. - Assert that Giant is held when it is passed as a sleep interlock.
# 181921	20-Aug-2008	ed	Remove the now unused `lbolt' variable from the kernel. We used to have a single wait channel inside the kernel which could be used by threads that just wanted to sleep for some time (the next second). The old TTY layer was the only piece of code that still used lbolt, because I already removed the use of lbolt from the NFS clients and the VFS syncer. Approved by: philip
# 181394	07-Aug-2008	jhb	Permit Giant to be passed as the explicit interlock either to msleep/mtx_sleep or the various cv_wait() routines. Currently, the "unlock" behavior of PDROP and cv_wait_unlock() with Giant is not permitted as it is will be confusing since Giant is fully unrecursed and unlocked during a thread sleep. This is handy for subsystems which wish to allow unlocked drivers to continue to use Giant such as CAM, the new TTY layer, and the new USB stack. CAM currently uses a hack that I told Scott to use because I really didn't want to permit this behavior, and the TTY and USB patches both have various patches to permit this. MFC after: 2 weeks
# 181334	05-Aug-2008	jhb	If a thread that is swapped out is made runnable, then the setrunnable() routine wakes up proc0 so that proc0 can swap the thread back in. Historically, this has been done by waking up proc0 directly from setrunnable() itself via a wakeup(). When waking up a sleeping thread that was swapped out (the usual case when waking proc0 since only sleeping threads are eligible to be swapped out), this resulted in a bit of recursion (e.g. wakeup() -> setrunnable() -> wakeup()). With sleep queues having separate locks in 6.x and later, this caused a spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock). An attempt was made to fix this in 7.0 by making the proc0 wakeup use the ithread mechanism for doing the wakeup. However, this required grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated into the same LOR since the thread lock would be some other sleepq lock. Fix this by deferring the wakeup of the swapper until after the sleepq lock held by the upper layer has been locked. The setrunnable() routine now returns a boolean value to indicate whether or not proc0 needs to be woken up. The end result is that consumers of the sleepq API such as *sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup proc0 if they get a non-zero return value from sleepq_abort(), sleepq_broadcast(), or sleepq_signal(). Discussed with: jeff Glanced at by: sam Tested by: Jurgen Weber jurgen - ish com au MFC after: 2 weeks
# 178272	17-Apr-2008	jeff	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
# 177272	16-Mar-2008	rwatson	Consistently use ANSI C declarationsfor all functions in kern_synch.c.
# 177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 177091	12-Mar-2008	jeff	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# 177085	12-Mar-2008	jeff	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
# 177010	10-Mar-2008	jeff	- Handle kdb switch panics outside of mi_switch() to remove some instructions from the common path and make the code more clear. Whether this has any impact on performance may depend on optimization levels. Sponsored by: Nokia
# 175219	10-Jan-2008	rwatson	Don't zero td_runtime when billing thread CPU usage to the process; maintain a separate td_incruntime to hold unbilled CPU usage for the thread that has the previous properties of td_runtime. When thread information is requested using the thread monitoring sysctls, export thread td_runtime instead of process rusage runtime in kinfo_proc. This restores the display of individual ithread and other kernel thread CPU usage since inception in ps -H and top -SH, as well for libthr user threads, valuable debugging information lost with the move to try kthreads since they are no longer independent processes. There is universal agreement that we should rewrite the process and thread export sysctls, but this commit gets things going a bit better in the mean time. Likewise, there are resevations about the continued validity of statclock given the speed of modern processors. Reviewed by: attilio, emaste, jhb, julian
# 173601	14-Nov-2007	julian	A bunch more files that should probably print out a thread name instead of a process name.
# 173600	14-Nov-2007	julian	generally we are interested in what thread did something as opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.
# 172482	08-Oct-2007	jeff	- Restore historical yield() behavior by manually lowering priority and switching. Approved by: re
# 172207	17-Sep-2007	jeff	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 170294	04-Jun-2007	jeff	Commit 2/14 of sched_lock decomposition. - Adapt sleepqueues to the new thread_lock() mechanism. - Delay assigning the sleep queue spinlock as the thread lock until after we've checked for signals. It is illegal for a thread to return in mi_switch() with any lock assigned to td_lock other than the scheduler locks. - Change sleepq_catch_signals() to do the switch if necessary to simplify the callers. - Simplify timeout handling now that locking a sleeping thread has the side-effect of locking the sleepqueue. Some previous races are no longer possible. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 170292	04-Jun-2007	attilio	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)
# 170175	31-May-2007	jeff	Forced commit to describe changes in the last revision. - Move cpu limit handling to a callout that runs once per-second and sums up all threads tick times to check for violations. This removes all code from mi_switch() that touches the proc. This also cleans up ast() a bit by removing one large case.
# 170174	31-May-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# 170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 169392	08-May-2007	jhb	Fix a potential LOR with sx_sleep() and cv_wait() with sx locks by 1) adding the thread to the sleepq via sleepq_add() before dropping the lock, and 2) dropping the sleepq lock around calls to lc_unlock() for sleepable locks (i.e. locks that use sleepq's in their implementation).
# 167787	21-Mar-2007	jhb	Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes, rwlocks, and sx locks to 'lock_object'.
# 167387	09-Mar-2007	jhb	Allow threads to atomically release rw and sx locks while waiting for an event. Locking primitives that support this (mtx, rw, and sx) now each include their own foo_sleep() routine. - Rename msleep() to _sleep() and change it's 'struct mtx' object to a 'struct lock_object' pointer. _sleep() uses the recently added lc_unlock() and lc_lock() function pointers for the lock class of the specified lock to release the lock while the thread is suspended. - Add wrappers around _sleep() for mutexes (mtx_sleep()), rw locks (rw_sleep()), and sx locks (sx_sleep()). msleep() still exists and is now identical to mtx_sleep(), but it is deprecated. - Rename SLEEPQ_MSLEEP to SLEEPQ_SLEEP. - Rewrite much of sleep.9 to not be msleep(9) centric. - Flesh out the 'RETURN VALUES' section in sleep.9 and add an 'ERRORS' section. - Add __nonnull(1) to _sleep() and msleep_spin() so that the compiler will warn if you try to pass a NULL wait channel. The functions already have a KASSERT to that effect.
# 167327	08-Mar-2007	julian	Instead of doing comparisons using the pcpu area to see if a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
# 167232	05-Mar-2007	rwatson	Further system call comment cleanup: - Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde) - Remove extra blank lines in some cases. - Add extra blank lines in some cases. - Remove no-op comments consisting solely of the function name, the word "syscall", or the system call name. - Add punctuation. - Re-wrap some comments.
# 167089	27-Feb-2007	jhb	Print tid's rather than thread pointers in KTR_PROC traces.
# 166908	23-Feb-2007	jhb	Add a new kernel sleep function pause(9). pause(9) is for places that want an equivalent of DELAY(9) that sleeps instead of spins. It accepts a wmesg and a timeout and is not interrupted by signals. It uses a private wait channel that should never be woken up by wakeup(9) or wakeup_one(9). Glanced at by: phk
# 165741	03-Jan-2007	jeff	- Fix schedgraph output with KSE threads. Call thread_switchout() after calling CTR() so we don't confuse a new kse thread with a real preemption.
# 165272	16-Dec-2006	kmacy	Add second sleep queue so that sx and lockmgr can have separate sleep queues for shared and exclusive acquisitions Submitted by: Attilio Rao Approved by: jhb
# 164766	30-Nov-2006	phk	Only grab the sched_lock if we actually need to modify the thread priority. During a buildworld only 2/3 of the calls to msleep actually changed the priority.
# 164325	15-Nov-2006	pjd	Change sleepq_add(9) argument from 'struct mtx ' to 'struct lock_object ', which allows to use it with different kinds of locks. For example it allows to implement Solaris conditions variables which will be used in ZFS port on top of sx(9) locks. Reviewed by: jhb
# 164307	15-Nov-2006	jhb	Adjust assertions to allow for magical properties of the 'lbolt' wait channel for tsleep(): - Allow tsleep() on &lbolt without Giant with a timeout 0 since &lbolt has an implied timeout. - If &lbolt is used with msleep() pass NULL to sleepq_add() for the lock object. Unlike other sleepq channels, &lbolt doesn't have an associated owning lock.
# 163709	26-Oct-2006	jb	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# 159631	15-Jun-2006	davidxu	Use scheduler API sched_relinquish() to implement yield() syscall.
# 159205	03-Jun-2006	jhb	In the case of reentering the debugger due to an attempt to perform a context switch while in the debugger, reenter the debugger sooner before performing any statistics updates.
# 157815	17-Apr-2006	jhb	Change msleep() and tsleep() to not alter the calling thread's priority if the specified priority is zero. This avoids a race where the calling thread could read a snapshot of it's current priority, then a different thread could change the first thread's priority, then the original thread would call sched_prio() inside msleep() undoing the change made by the second thread. I used a priority of zero as no thread that calls msleep() or tsleep() should be specifying a priority of zero anyway. The various places that passed 'curthread->td_priority' or some variant as the priority now pass 0.
# 155932	22-Feb-2006	davidxu	Fix a sleep queue race for KSE thread. Reviewed by: jhb
# 155924	22-Feb-2006	jhb	Fixup some comments. Mutexes's are locked, not entered for several years now and msleep blocks threads rather than processes.
# 155741	15-Feb-2006	davidxu	Fix a long standing race between sleep queue and thread suspension code. When a thread A is going to sleep, it calls sleepq_catch_signals() to detect any pending signals or thread suspension request, if nothing happens, it returns without holding process lock or scheduler lock, this opens a race window which allows thread B to come in and do process suspension work, however since A is still at running state, thread B can do nothing to A, thread A continues, and puts itself into actually sleeping state, but B has never seen it, and it sits there forever until B is woken up by other threads sometimes later(this can be very long delay or never happen). Fix this bug by forcing sleepq_catch_signals to return with scheduler lock held. Fix sleepq_abort() by passing it an interrupted code, previously, it worked as wakeup_one(), and the interruption can not be identified correctly by sleep queue code when the sleeping thread is resumed. Let thread_suspend_check() returns EINTR or ERESTART, so sleep queue no longer has to use SIGSTOP as a hack to build a return value. Reviewed by: jhb MFC after: 1 week
# 155534	11-Feb-2006	phk	CPU time accounting speedup (step 2) Keep accounting time (in per-cpu) cputicks and the statistics counts in the thread and summarize into struct proc when at context switch. Don't reach across CPUs in calcru(). Add code to calibrate the top speed of cpu_tickrate() for variable cpu_tick hardware (like TSC on power managed machines). Don't enforce monotonicity (at least for now) in calcru. While the calibrated cpu_tickrate ramps up it may not be true. Use 27MHz counter on i386/Geode. Use TSC on amd64 & i386 if present. Use tick counter on sparc64
# 155444	07-Feb-2006	phk	Modify the way we account for CPU time spent (step 1) Keep track of time spent by the cpu in various contexts in units of "cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t only when somebody wants to inspect the numbers. For now "cputicks" are still derived from the current timecounter and therefore things should by definition remain sensible also on SMP machines. (The main reason for this first milestone commit is to verify that hypothesis.) On slower machines, the avoided multiplications to normalize timestams at every context switch, comes out as a 5-7% better score on the unixbench/context1 microbenchmark. On more modern hardware no change in performance is seen.
# 153856	29-Dec-2005	jhb	patch(1) and I aren't friends today. Axe a duplicate copy of the msleep_spin() function definition. Spotted by: pjd
# 153855	29-Dec-2005	jhb	Add a new function msleep_spin() which is a slightly stripped down version of msleep(). msleep_spin() doesn't support changing the priority of the thread while it is asleep nor does it support interruptible sleeps (PCATCH) or the PDROP flag. It does support timeouts however. It differs from msleep() in that the passed in mutex is a spin mutex. This means one can use msleep_spin() and wakeup() with a spin mutex similar to msleep() and wakeup() with a regular mutex. Note that the spin mutex in question needs to come before sched_lock and the sleepq locks in lock order.
# 152899	28-Nov-2005	jhb	When checking to see if a process has exceeded its time limit, flag the process as over the limit when its time is >= to the limit rather than > the limit. Technically, if p->p_rux.rux_runtime.sec == p->p_pcpulimit and p->p_rux.rux_runtime.frac == 0, the process hasn't exceeded the limit yet. However, having the fraction exactly equal to 0 is rather rare, and it is not worth the overhead to handle that edge case. With just the > comparison, the process would have to exceed its limit by almost a second before it was killed. PR: kern/83192 Submitted by: Maciej Zawadzinski mzawadzinski at gmail dot com Reviewed by: bde MFC after: 1 week
# 146554	23-May-2005	ups	Use low level constructs borrowed from interrupt threads to wait for work in proc0. Remove the TDP_WAKEPROC0 workaround.
# 144777	08-Apr-2005	ups	Sprinkle some volatile magic and rearrange things a bit to avoid race conditions in critical_exit now that it no longer blocks interrupts. Reviewed by: jhb
# 144425	31-Mar-2005	jhb	Don't recursively panic when we call mi_switch() in a critical section, even though calling mi_switch() after a panic is likely a bug anyway as the recursive panic only serves to make things worse.
# 139451	30-Dec-2004	jhb	Stop explicitly touching td_base_pri outside of the scheduler and simply set a thread's priority via sched_prio() when that is the desired action. The schedulers will start managing td_base_pri internally shortly.
# 139315	25-Dec-2004	jeff	- Define KTR points for KTR_SCHED.
# 138131	27-Nov-2004	davidxu	Unlock mutex if PDROP was set by caller.
# 136583	16-Oct-2004	scottl	If a process needs to be swapped in, wakeup the swapper from within critical_exit as the process is getting scheduled to run. This is subotimal but for now avoid the LOR between the scheduler and the sleepq systems. This is a 5.3 candidate. Submitted by: davidxu MFC After: 3 days
# 136445	12-Oct-2004	jhb	Refine the turnstile and sleep queue interfaces just a bit: - Add a new _lock() call to each API that locks the associated chain lock for a lock_object pointer or wait channel. The _lookup() functions now require that the chain lock be locked via _lock() when they are called. - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup the associated queue structure internally via _lookup() rather than accepting a pointer from the caller. For turnstiles, this means that the actual lookup of the turnstile in the hash table is only done when the thread actually blocks rather than being done on each loop iteration in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() is no longer used outside of the sleep queue code except to implement an assertion in cv_destroy(). - Change sleepq_broadcast() and sleepq_signal() to require that the chain lock is already required. For condition variables, this lets the cv_broadcast() and cv_signal() functions lock the sleep queue chain lock while testing the waiters count. This means that the waiters count internal to condition variables is no longer protected by the interlock mutex and cv_broadcast() and cv_signal() now no longer require that the interlock be held when they are called. This lets consumers of condition variables drop the lock before waking other threads which can result in fewer context switches. MFC after: 1 month
# 136152	05-Oct-2004	jhb	Rework how we store process times in the kernel such that we always store the raw values including for child process statistics and only compute the system and user timevals on demand. - Fix the various kern_wait() syscall wrappers to only pass in a rusage pointer if they are going to use the result. - Add a kern_getrusage() function for the ABI syscalls to use so that they don't have to play stackgap games to call getrusage(). - Fix the svr4_sys_times() syscall to just call calcru() to calculate the times it needs rather than calling getrusage() twice with associated stackgap, etc. - Add a new rusage_ext structure to store raw time stats such as tick counts for user, system, and interrupt time as well as a bintime of the total runtime. A new p_rux field in struct proc replaces the same inline fields from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux field in struct proc contains the "raw" child time usage statistics. ruadd() has been changed to handle adding the associated rusage_ext structures as well as the values in rusage. Effectively, the values in rusage_ext replace the ru_utime and ru_stime values in struct rusage. These two fields in struct rusage are no longer used in the kernel. - calcru() has been split into a static worker function calcru1() that calculates appropriate timevals for user and system time as well as updating the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a copy of the process' p_rux structure to compute the timevals after updating the runtime appropriately if any of the threads in that process are currently executing. It also now only locks sched_lock internally while doing the rux_runtime fixup. calcru() now only requires the caller to hold the proc lock and calcru1() only requires the proc lock internally. calcru() also no longer allows callers to ask for an interrupt timeval since none of them actually did. - calcru() now correctly handles threads executing on other CPUs. - A new calccru() function computes the child system and user timevals by calling calcru1() on p_crux. Note that this means that any code that wants child times must now call this function rather than reading from p_cru directly. This function also requires the proc lock. - This finishes the locking for rusage and friends so some of the Giant locks in exit1() and kern_wait() are now gone. - The locking in ttyinfo() has been tweaked so that a shared lock of the proctree lock is used to protect the process group rather than the process group lock. By holding this lock until the end of the function we now ensure that the process/thread that we pick to dump info about will no longer vanish while we are trying to output its info to the console. Submitted by: bde (mostly) MFC after: 1 month
# 135295	16-Sep-2004	julian	clean up thread runq accounting a bit. MFC after: 3 days
# 135051	10-Sep-2004	julian	Add some code to allow threads to nominat a sibling to run if theyu are going to sleep. MFC after: 1 week
# 134791	05-Sep-2004	julian	Refactor a bunch of scheduler code to give basically the same behaviour but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
# 134013	19-Aug-2004	jhb	Now that the return value semantics of cv's for multithreaded processes have been unified with that of msleep(9), further refine the sleepq interface and consolidate some duplicated code: - Move the pre-sleep checks for theaded processes into a thread_sleep_check() function in kern_thread.c. - Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c. Specifically, if a thread is awakened by something other than a signal while checking for signals before going to sleep, clear TDF_SINTR in sleepq_catch_signals(). This removes a sched_lock lock/unlock combo in that edge case during an interruptible sleep. Also, fix sleepq_check_signals() to properly handle the condition if TDF_SINTR is clear rather than requiring the callers of the sleepq API to notice this edge case and call a non-_sig variant of sleepq_wait(). - Clarify the flags arguments to sleepq_add(), sleepq_signal() and sleepq_broadcast() by creating an explicit submask for sleepq types. Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of 0. Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and move the setting of TDF_SINTR to sleepq_add() if this flag is set rather than sleepq_catch_signals(). Note that it is the caller's responsibility to ensure that sleepq_catch_signals() is called if and only if this flag is passed to the preceeding sleepq_add(). Note that this also removes a sched_lock lock/unlock pair from sleepq_catch_signals(). It also ensures that for an interruptible sleep, TDF_SINTR is always set when TD_ON_SLEEPQ() is true.
# 133396	09-Aug-2004	julian	Increase the amount of data exported by KTR in the KTR_RUNQ setting. This extra data is needed to really follow what is going on in the threaded case.
# 133138	04-Aug-2004	jhb	Workaround a possible deadlock on SMP due to a spin lock LOR by disabling the immediate awakening of proc0 (scheduler kproc, controls swapping processes in and out). The scheduler process periodically awakens already, so this will not result in processes not being swapped in, there will just be more latency in between a thread being made runnable and the scheduler waking up to swap the affected process back in.
# 132773	28-Jul-2004	davidxu	Use P_SINGLE_EXIT to check single-threading case, P_WEXIT is not for that purpose.
# 132266	16-Jul-2004	jhb	- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags since they are only accessed by curthread and thus do not need any locking. - Move pr_addr and pr_ticks out of struct uprof (which is per-process) and directly into struct thread as td_profil_addr and td_profil_ticks as these variables are really per-thread. (They are used to defer an addupc_intr() that was too "hard" until ast()).
# 131927	10-Jul-2004	marcel	Update for the KDB framework: o Make debugging code conditional upon KDB instead of DDB. o Call kdb_enter() instead of Debugger(). o Call kdb_backtrace() instead of db_print_backtrace() or backtrace(). kern_mutex.c: o Replace checks for db_active with checks for kdb_active and make them unconditional. kern_shutdown.c: o s/DDB_UNATTENDED/KDB_UNATTENDED/g o s/DDB_TRACE/KDB_TRACE/g o Save the TID of the thread doing the kernel dump so the debugger knows which thread to select as the current when debugging the kernel core file. o Clear kdb_active instead of db_active and do so unconditionally. o Remove backtrace() implementation. kern_synch.c: o Call kdb_reenter() instead of db_error().
# 131481	02-Jul-2004	jhb	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
# 131473	02-Jul-2004	jhb	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
# 131249	28-Jun-2004	jhb	Remove the signal_caught argument from sleepq_timedwait() as it was effectively always zero.
# 130182	07-Jun-2004	tjr	Remove a stale and misleading comment.
# 129241	14-May-2004	bde	Fixed some common printf format errors. Don't assume that "struct foo " is "void " (it isn't) or that the default promotion of pid_t is int. Instead, assume that casting "struct foo " to "void " and printing the result with %p is useful, and that all pid_t's are representable as longs. Fixed some minor style bugs (mainly spelling errors in comments).
# 127911	05-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 126885	12-Mar-2004	jhb	- Remove old sleep queues. - Remove sleepqueue argument from sleepq_set_timeout() since it is not used.
# 126755	08-Mar-2004	rwatson	Mark loadaverage callout as CALLOUT_MPSAFE. Reviewed by: jhb
# 126487	02-Mar-2004	jhb	Correct handling of PDROP in msleep() to just skip the mtx_lock() rather than clear the lock pointer so that sleepq_add() still gets the correct lock pointer and doesn't bogusly trip an assertion.
# 126326	27-Feb-2004	jhb	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# 125292	01-Feb-2004	jeff	- Revert rev 1.240 we no longer need a kthread for loadav().
# 125290	01-Feb-2004	jeff	- Use sched_load() rather than grabbing the sx lock and traversing the proc table to discover the load.
# 125162	28-Jan-2004	jhb	Move the loadav() callout into its own kthread since it uses allproc_lock which is a sleepable lock and thus is not safe to acquire from a callout routine.
# 124954	25-Jan-2004	jeff	- Use a unique string for the sched_setup SYSINIT and rename sched_setup to synch_setup. The schedulers use the sched_setup function name.
# 124944	25-Jan-2004	jeff	- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or SW_INVOL. Assert that one of these is set in mi_switch() and propery adjust the rusage statistics. This is to simplify the large number of users of this interface which were previously all required to adjust the proper counter prior to calling mi_switch(). This also facilitates more switch and locking optimizations. - Change all callers of mi_switch() to pass the appropriate paramter and remove direct references to the process statistics.
# 121688	29-Oct-2003	bde	Removed mostly-dead code for setting switchtime after the idle loop clobbers this variable. Long ago, when the idle loop wasn't in a process, it set switchtime.tv_sec to zero to indicate that the time needs to be read after the idle loop finishes. The special case for this isn't needed now that there is an idle process (for each CPU). The time is read in the normal way when the idle process is switched away from. The seconds component of the time is only zero for the first second after the uptime is set, and the mostly-dead code was only executed during this time. (This was slightly broken by using uptimes instead of times relative to the Epoch -- in the original version the seconds component of the time was only 0 for the first second after the Epoch.) In mi_switch(), moved the setting of switchticks to just after the first (and now only) setting of switchtime. This setting used to be delayed since a late setting was needed for the idle case and an early setting was not needed. Now the early setting is needed so that fork_exit() doesn't need to set either switchtime or switchticks. Removed now-completely-rotted comment attached to this. Most of the code described by the comment had already moved to sched_switch().
# 121128	16-Oct-2003	jeff	- Collapse sched_switchin() and sched_switchout() into sched_switch(). Now mi_switch() calls sched_switch() which calls cpu_switch(). This is actually one less function call than it had been.
# 120802	05-Oct-2003	bms	Add a pre-emption counter, td_generation, so that threads can notice when they have been pre-empted by other threads. This is bumped from within mi_switch() every time a context switch takes place. Discussed with: pete
# 120602	30-Sep-2003	jeff	- On my Pentium4-M laptop, invalpg takes ~1100 cycles if the page is found in the TLB and ~1600 if it is not. Therefore, it is more effecient to invalidate the TLB after operations that use CMAP rather than before. - So that the tlb is invalidated prior to switching off of a processor, we must change the switchin functions to switchout functions. - Remove td_switchout from the thread and move it to the x86 pcb. - Move the code that calls switchout into swtch.s. These changes make this optimization truely x86 specific.
# 119137	19-Aug-2003	sam	Change instances of callout_init that specify MPSAFE behaviour to use CALLOUT_MPSAFE instead of "1" for the second parameter. This does not change the behaviour; it just makes the intent more clear.
# 118972	15-Aug-2003	jhb	- Various style fixes in both code and comments. - Update some stale comments. - Sort a couple of includes. - Only set 'newcpu' in updatepri() if we use it. - No functional changes. Obtained from: bde (via an old diff I got a long time ago)
# 118893	14-Aug-2003	grehan	Update powerpc to use the (old thread,new thread) calling convention for cpu_throw() and cpu_switch().
# 118835	12-Aug-2003	jhb	- Convert Alpha over to the new calling conventions for cpu_throw() and cpu_switch() where both the old and new threads are passed in as arguments. Only powerpc uses the old conventions now. - Update comments in the Alpha swtch.s to reflect KSE changes. Tested by: obrien, marcel
# 117372	09-Jul-2003	peter	unifdef -DLAZY_SWITCH and start to tidy up the associated glue.
# 116963	28-Jun-2003	davidxu	o Change kse_thr_interrupt to allow send a signal to a specified thread, or unblock a thread in kernel, and allow UTS to specify whether syscall should be restarted. o Add ability for UTS to monitor signal comes in and removed from process, the flag PS_SIGEVENT is used to indicate the events. o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with this flag set to wait for above signal event. o For SA based thread, kernel masks all signal in its signal mask, let UTS to use kse_thr_interrupt interrupt a thread, and install a signal frame in userland for the thread. o Add a tm_syncsig in thread mailbox, when a hardware trap occurs, it is used to deliver synchronous signal to userland, and upcall is schedule, so UTS can process the synchronous signal for the thread. Reviewed by: julian (mentor)
# 116930	27-Jun-2003	peter	Tidy up leftover lazy_switch instrumentation that is no longer needed. This cleans up some #ifdef hell.
# 116361	14-Jun-2003	davidxu	Rename P_THREADED to P_SA. P_SA means a process is using scheduler activations.
# 116182	10-Jun-2003	obrien	Use __FBSDID().
# 115539	31-May-2003	phk	Add a couple of XXX comments where the intent is not clear. Found by: FlexeLint
# 115084	16-May-2003	marcel	Revamp of the syscall path, exception and context handling. The prime objectives are: o Implement a syscall path based on the epc inststruction (see sys/ia64/ia64/syscall.s). o Revisit the places were we need to save and restore registers and define those contexts in terms of the register sets (see sys/ia64/include/_regset.h). Secundairy objectives: o Remove the requirement to use contigmalloc for kernel stacks. o Better handling of the high FP registers for SMP systems. o Switch to the new cpu_switch() and cpu_throw() semantics. o Add a good unwinder to reconstruct contexts for the rare cases we need to (see sys/contrib/ia64/libuwx) Many files are affected by this change. Functionally it boils down to: o The EPC syscall doesn't preserve registers it does not need to preserve and places the arguments differently on the stack. This affects libc and truss. o The address of the kernel page directory (kptdir) had to be unstaticized for use by the nested TLB fault handler. The name has been changed to ia64_kptdir to avoid conflicts. The renaming affects libkvm. o The trapframe only contains the special registers and the scratch registers. For syscalls using the EPC syscall path no scratch registers are saved. This affects all places where the trapframe is accessed. Most notably the unaligned access handler, the signal delivery code and the debugger. o Context switching only partly saves the special registers and the preserved registers. This affects cpu_switch() and triggered the move to the new semantics, which additionally affects cpu_throw(). o The high FP registers are either in the PCB or on some CPU. context switching for them is done lazily. This affects trap(). o The mcontext has room for all registers, but not all of them have to be defined in all cases. This mostly affects signal delivery code now. The *context syscalls are as of yet still unimplemented. Many details went into the removal of the requirement to use contigmalloc for kernel stacks. The details are mostly CPU specific and limited to exception_save() and exception_restore(). The few places where we create, destroy or switch stacks were mostly simplified by not having to construct physical addresses and additionally saving the virtual addresses for later use. Besides more efficient context saving and restoring, which of course yields a noticable speedup, this also fixes the dreaded SMP bootup problem as a side-effect. The details of which are still not fully understood. This change includes all the necessary backward compatibility code to have it handle older userland binaries that use the break instruction for syscalls. Support for break-based syscalls has been pessimized in favor of a clean implementation. Due to the overall better performance of the kernel, this will still be notived as an improvement if it's noticed at all. Approved by: re@ (jhb)
# 114983	13-May-2003	jhb	- Merge struct procsig with struct sigacts. - Move struct sigacts out of the u-area and malloc() it using the M_SUBPROC malloc bucket. - Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(), sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared(). - Remove the p_sigignore, p_sigacts, and p_sigcatch macros. - Add a mutex to struct sigacts that protects all the members of the struct. - Add sigacts locking. - Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now that sigacts is locked. - Several in-kernel functions such as psignal(), tdsignal(), trapsignal(), and thread_stopped() are now MP safe. Reviewed by: arch@ Approved by: re (rwatson)
# 114750	05-May-2003	jhb	Remove TD_ON_RUNQ() from a check to make sure Giant is not held when calling mi_switch(). The kernel would panic on an earlier KASSERT() in mi_switch() if TD_ON_RUNQ() was true.
# 114435	01-May-2003	jhb	Garbage collect unused TDF_INMSLEEP flag.
# 114336	30-Apr-2003	peter	AMD64 uses the new-style cpu_switch()/cpu_throw() calling conventions.
# 113920	23-Apr-2003	jhb	- Protect p_numthreads with the sched_lock. - Protect p_singlethread with both the sched_lock and the proc lock. - Protect p_suspcount with the proc lock.
# 112993	02-Apr-2003	peter	Commit a partial lazy thread switch mechanism for i386. it isn't as lazy as it could be and can do with some more cleanup. Currently its under options LAZY_SWITCH. What this does is avoid %cr3 reloads for short context switches that do not involve another user process. ie: we can take an interrupt, switch to a kthread and return to the user without explicitly flushing the tlb. However, this isn't as exciting as it could be, the interrupt overhead is still high and too much blocks on Giant still. There are some debug sysctls, for stats and for an on/off switch. The main problem with doing this has been "what if the process that you're running on exits while we're borrowing its address space?" - in this case we use an IPI to give it a kick when we're about to reclaim the pmap. Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a few more things and get some more feedback before turning it on by default. This is NOT a replacement for Bosko's lazy interrupt stuff. This was more meant for the kthread case, while his was for interrupts. Mine helps a little for interrupts, but his helps a lot more. The stats are enabled with options SWTCH_OPTIM_STATS - this has been a pseudo-option for years, I just added a bunch of stuff to it. One non-trivial change was to select a new thread before calling cpu_switch() in the first place. This allows us to catch the silly case of doing a cpu_switch() to the current process. This happens uncomfortably often. This simplifies a bit of the asm code in cpu_switch (no longer have to call choosethread() in the middle). This has been implemented on i386 and (thanks to jake) sparc64. The others will come soon. This is actually seperate to the lazy switch stuff. Glanced at by: jake, jhb
# 112910	31-Mar-2003	jeff	- Borrow the KSE single threading code for exec and exit. We use the check if (p->p_numthreads > 1) and not a flag because action is only necessary if there are other threads. The rest of the system has no need to identify thr threaded processes. - In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED is not set.
# 112397	19-Mar-2003	davidxu	Adjust code for userland preemptive. Userland can set a quantum in kse_mailbox to schedule an upcall, this is useful for userland timeout routine, for example pthread_cond_timedwait(). Also extract upcall scheduling code from kse_reassign and create a new function called thread_switchout to include these code. Reviewed by: julain
# 111883	04-Mar-2003	jhb	Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls to WITNESS_WARN().
# 111605	27-Feb-2003	harti	When a process has been waiting on a condition variable or mutex the td_wmesg field in the thread structure points to the description string of the condition variable or mutex. If the condvar or the mutex had been initialized from a loadable module that was unloaded in the meantime, td_wmesg may now point to invalid memory. Retrieving the process table now may panic the kernel (or access junk). Setting the td_wmesg field to NULL after unblocking on the condvar/mutex prevents this panic. PR: kern/47408 Approved by: jake (mentor)
# 111585	27-Feb-2003	julian	Change the process flags P_KSES to be P_THREADED. This is just a cosmetic change but I've been meaning to do it for about a year.
# 111032	17-Feb-2003	julian	Move a bunch of flags from the KSE to the thread. I was in two minds as to where to put them in the first case.. I should have listenned to the other mind. Submitted by: parts by davidxu@ Reviewed by: jeff@ mini@
# 110858	14-Feb-2003	alfred	Print a backtrace in case we tsleep from inside of DDB.
# 108338	27-Dec-2002	julian	Add code to ddb to allow backtracing an arbitrary thread. (show thread {address}) Remove the IDLE kse state and replace it with a change in the way threads sahre KSEs. Every KSE now has a thread, which is considered its "owner" however a KSE may also be lent to other threads in the same group to allow completion of in-kernel work. n this case the owner remains the same and the KSE will revert to the owner when the other work has been completed. All creations of upcalls etc. is now done from kse_reassign() which in turn is called from mi_switch or thread_exit(). This means that special code can be removed from msleep() and cv_wait(). kse_release() does not leave a KSE with no thread any more but converts the existing thread into teh KSE's owner, and sets it up for doing an upcall. It is just inhibitted from being scheduled until there is some reason to do an upcall. Remove all trace of the kse_idle queue since it is no-longer needed. "Idle" KSEs are now on the loanable queue.
# 107719	10-Dec-2002	julian	Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage during the context switch. Rearrange thread cleanups to avoid problems with Giant. Clean threads when freed or when recycled. Approved by: re (jhb)
# 107134	21-Nov-2002	jeff	- Move FSCALE back to kern_sync. This is not scheduler specific. - Create a new callout for lbolt and move it out of schedcpu(). This is not scheduler specific either. Approved by: re
# 106180	30-Oct-2002	davidxu	Add an actual implementation of kse_thr_interrupt()
# 105911	25-Oct-2002	julian	More work on the interaction between suspending and sleeping threads. Also clean up some code used with 'single-threading'. Reviewed by: davidxu
# 104964	12-Oct-2002	jeff	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
# 104719	09-Oct-2002	jhb	- Move p_cpulimit to struct proc from struct plimit and protect it with sched_lock. This means that we no longer access p_limit in mi_switch() and the p_limit pointer can be protected by the proc lock. - Remove PRS_ZOMBIE check from CPU limit test in mi_switch(). PRS_ZOMBIE processes don't call mi_switch(), and even if they did there is no longer the danger of p_limit being NULL (which is what the original zombie check was added for). - When we bump the current processes soft CPU limit in ast(), just bump the private p_cpulimit instead of the shared rlimit. This fixes an XXX for some value of fix. There is still a (probably benign) bug in that this code doesn't check that the new soft limit exceeds the hard limit. Inspired by: bde (2)
# 104695	09-Oct-2002	julian	Round out the facilty for a 'bound' thread to loan out its KSE in specific situations. The owner thread must be blocked, and the borrower can not proceed back to user space with the borrowed KSE. The borrower will return the KSE on the next context switch where teh owner wants it back. This removes a lot of possible race conditions and deadlocks. It is consceivable that the borrower should inherit the priority of the owner too. that's another discussion and would be simple to do. Also, as part of this, the "preallocatd spare thread" is attached to the thread doing a syscall rather than the KSE. This removes the need to lock the scheduler when we want to access it, as it's now "at hand". DDB now shows a lot mor info for threaded proceses though it may need some optimisation to squeeze it all back into 80 chars again. (possible JKH project) Upcalls are now "bound" threads, but "KSE Lending" now means that other completing syscalls can be completed using that KSE before the upcall finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is one of the highest priorities in the KSE system.) The upcall when it happens will present all the completed syscalls to the KSE for selection.
# 104394	03-Oct-2002	jmallett	XXX Add a check for p->p_limit being NULL before dereferencing it. This is totally bogus but will hide the occurances of access of 0xbc(NULL) which people have run into lately. This is not a proper fix, just a bandaid, until the cause of this happening is tracked down and fixed. Reviewed by: rwatson
# 104387	02-Oct-2002	jhb	Rename the mutex thread and process states to use a more generic 'LOCK' name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will be used internally by the turnstile code and will not be specific to mutexes. Making the change now ensures that turnstiles can be dropped in at a later date without affecting the ABI of userland applications.
# 104295	01-Oct-2002	jhb	- Adjust comment noting that handling of CPU limit exhaustion is done in ast(). - Actually set KEF_ASTPENDING so ast() is called. I think this is buggy for a process with multiple KSE's in that PS_XCPU is not a KSE event, it's a process-wide event. IMO there really should probably be two ASTPENDING flags, one for per-process, and one for per-KSE. Submitted by: bde
# 104240	30-Sep-2002	jhb	- Add a new per-process flag PS_XCPU to indicate that at least one thread has exceeded its CPU time limit. - In mi_switch(), set PS_XCPU when the CPU time limit is exceeded. - Perform actual CPU time limit exceeded work in ast() when PS_XCPU is set. Requested by: many
# 104157	29-Sep-2002	julian	Implement basic KSE loaning. This stops a hread that is blocked in BOUND mode from stopping another thread from completing a syscall, and this allows it to release its resources etc. Probably more related commits to follow (at least one I know of) Initial concept by: julian, dillon Submitted by: davidxu
# 103216	11-Sep-2002	julian	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# 102592	29-Aug-2002	julian	Rejig the code to figure out estcpu and work out how long a KSEGRP has been idle. What was there before was surprisingly ALMOST correct. Peter and I fried our brains on this for a couple of hours figuring out what this actually means in the context of multiple threads. Reviewed by: peter@freebsd.org
# 102544	28-Aug-2002	peter	updatepri() works on a ksegrp (where the scheduling parameters are), so directly give it the ksegrp instead of the thread. The only thing it used to use in the thread was the ksegrp. Reviewed by: julian
# 101176	01-Aug-2002	julian	Slight cleanup of some comments/whitespace. Make idle process state more consistant. Add an assert on thread state. Clean up idleproc/mi_switch() interaction. Use a local instead of referencing curthread 7 times in a row (I've been told curthread can be expensive on some architectures) Remove some commented out code. Add a little commented out code (completion coming soon) Reviewed by: jhb@freebsd.org
# 100921	30-Jul-2002	tanimura	In endtsleep() and cv_timedwait_end(), a thread marked TDF_TIMEOUT may be swapped out. Do not put such the thread directly back to the run queue. Spotted by: David Xu <davidx@viasoft.com.cn> While I am here, s/PS_TIMEOUT/TDF_TIMEOUT/.
# 100913	30-Jul-2002	tanimura	- Optimize wakeup() and its friends; if a thread waken up is being swapped in, we do not have to ask for the scheduler thread to do that. - Assert that a process is not swapped out in runq functions and swapout(). - Introduce thread_safetoswapout() for readability. - In swapout_procs(), perform a test that may block (check of a thread working on its vm map) first. This lets us call swapout() with the sched_lock held, providing a better atomicity.
# 100884	29-Jul-2002	julian	Create a new thread state to describe threads that would be ready to run except for the fact tha they are presently swapped out. Also add a process flag to indicate that the process has started the struggle to swap back in. This will be needed for the case where multiple threads start the swapin action top a collision. Also add code to stop a process fropm being swapped out if one of the threads in this process is actually off running on another CPU.. that might hurt... Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
# 100647	24-Jul-2002	julian	slight stylisations to take into account recent code changes.
# 100262	17-Jul-2002	julian	Fix a reversed test. Fix some style nits. Fix a KASSERT message. Add/fix some comments. Submitted by: bde@freebsd.org
# 100210	17-Jul-2002	jhb	Add a KASSERT() to assert that td_critnest is == 1 when mi_switch() is called.
# 100209	17-Jul-2002	gallatin	Allow alphas to do crashdumps: Refuse to run anything in choosethread() after a panic which is not an interrupt thread, or the thread which caused the panic. Also, remove panicstr checks from msleep() and from cv_wait() in order to allow threads to go to sleep and yeild the cpu to the panicing thread, or to an interrupt thread which might be doing the crashdump. Reviewed by: jhb (and it was mostly his idea too)
# 99942	14-Jul-2002	julian	Thinking about it I came to the conclusion that the KSE states were incorrectly formulated. The correct states should be: IDLE: On the idle KSE list for that KSEG RUNQ: Linked onto the system run queue. THREAD: Attached to a thread and slaved to whatever state the thread is in. This means that most places where we were adjusting kse state can go away as it is just moving around because the thread is.. The only places we need to adjust the KSE state is in transition to and from the idle and run queues. Reviewed by: jhb@freebsd.org
# 99937	13-Jul-2002	julian	oops, state cannot be two different values at once.. use \|\| instead of &&
# 99890	12-Jul-2002	dillon	Re-enable the idle page-zeroing code. Remove all IPIs from the idle page-zeroing code as well as from the general page-zeroing code and use a lazy tlb page invalidation scheme based on a callback made at the end of mi_switch. A number of people came up with this idea at the same time so credit belongs to Peter, John, and Jake as well. Two-way SMP buildworld -j 5 tests (second run, after stabilization) 2282.76 real 2515.17 user 704.22 sys before peter's IPI commit 2266.69 real 2467.50 user 633.77 sys after peter's commit 2232.80 real 2468.99 user 615.89 sys after this commit Reviewed by: peter, jhb Approved by: peter
# 99488	06-Jul-2002	julian	make this repect ps_sigintr if there is a pre-existing signal or suspension request. Submitted by: David Xu
# 99480	06-Jul-2002	julian	Fix at least one of the things wrong with signals ^Z should work a lot better now. Submitted by: peter@freebsd.org
# 99337	03-Jul-2002	julian	Try clean up some of the mess that resulted from layers and layers of p4 merges from -current as things started getting different. Corroborated by: Similar patches just mailed by BDE.
# 99248	02-Jul-2002	julian	When going back to SLEEP state, make sure our State is correctly marked so.
# 99072	29-Jun-2002	julian	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 99012	29-Jun-2002	alfred	more caddr_t removal.
# 98714	23-Jun-2002	dillon	I Noticed a defect in the way wakeup() scans the tailq. Tor noticed an even worse defect in wakeup_one(). This patch cleans up both. Submitted by: tegge MFC after: 3 days
# 97995	07-Jun-2002	jhb	- Catch up to new ktrace API. - ktrace trace points in msleep() and cv_wait() no longer need Giant.
# 97526	29-May-2002	julian	CURSIG() is not a macro so rename it cursig(). Obtained from: KSE tree
# 97158	23-May-2002	jhb	Minor nit: get p pointer in msleep() from td->td_proc (where td == curthread) rather than from curproc.
# 92723	19-Mar-2002	alfred	Remove __P.
# 92666	19-Mar-2002	peter	Fix a gcc-3.1+ warning. warning: deprecated use of label at end of compound statement ie: you cannot do this anymore: switch(foo) { .... default: }
# 91066	22-Feb-2002	phk	Convert p->p_runtime and PCPU(switchtime) to bintime format.
# 90538	11-Feb-2002	julian	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)
# 88900	05-Jan-2002	jhb	Change the preemption code for software interrupt thread schedules and mutex releases to not require flags for the cases when preemption is not allowed: The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent switching to a higher priority thread on mutex releease and swi schedule, respectively when that switch is not safe. Now that the critical section API maintains a per-thread nesting count, the kernel can easily check whether or not it should switch without relying on flags from the programmer. This fixes a few bugs in that all current callers of swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from fast interrupt handlers and the swi_sched of softclock needed this flag. Note that to ensure that swi_sched()'s in clock and fast interrupt handlers do not switch, these handlers have to be explicitly wrapped in critical_enter/exit pairs. Presently, just wrapping the handlers is sufficient, but in the future with the fully preemptive kernel, the interrupt must be EOI'd before critical_exit() is called. (critical_exit() can switch due to a deferred preemption in a fully preemptive kernel.) I've tested the changes to the interrupt code on i386 and alpha. I have not tested ia64, but the interrupt code is almost identical to the alpha code, so I expect it will work fine. PowerPC and ARM do not yet have interrupt code in the tree so they shouldn't be broken. Sparc64 is broken, but that's been ok'd by jake and tmm who will be fixing the interrupt code for sparc64 shortly. Reviewed by: peter Tested on: i386, alpha
# 88088	17-Dec-2001	jhb	Modify the critical section API as follows: - The MD functions critical_enter/exit are renamed to start with a cpu_ prefix. - MI wrapper functions critical_enter/exit maintain a per-thread nesting count and a per-thread critical section saved state set when entering a critical section while at nesting level 0 and restored when exiting to nesting level 0. This moves the saved state out of spin mutexes so that interlocking spin mutexes works properly. - Most low-level MD code that used critical_enter/exit now use cpu_critical_enter/exit. MI code such as device drivers and spin mutexes use the MI wrappers. Note that since the MI wrappers store the state in the current thread, they do not have any return values or arguments. - mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is assigned to curthread->td_savecrit during fork_exit(). Tested on: i386, alpha
# 88019	16-Dec-2001	luigi	Add/correct description for some sysctl variables where it was missing. The description field is unused in -stable, so the MFC there is equivalent to a comment. It can be done at any time, i am just setting a reminder in 45 days when hopefully we are past 4.5-release. MFC after: 45 days
# 85368	23-Oct-2001	jhb	Assert that Giant is not held in mi_switch() unless the process state is SMTX or SRUN.
# 85237	20-Oct-2001	iedowse	Introduce some jitter to the timing of the samples that determine the system load average. Previously, the load average measurement was susceptible to synchronisation with processes that run at regular intervals such as the system bufdaemon process. Each interval is now chosen at random within the range of 4 to 6 seconds. This large variation is chosen so that over the shorter 5-minute load average timescale there is a good dispersion of samples across the 5-second sample period (the time to perform 60 5-second samples now has a standard deviation of approx 4.5 seconds).
# 85227	20-Oct-2001	iedowse	Move the code that computes the system load average from vm_meter.c to kern_synch.c in preparation for adding some jitter to the inter-sample time. Note that the "vm.loadavg" sysctl still lives in vm_meter.c which isn't the right place, but it is appropriate for the current (bad) name of that sysctl. Suggested by: jhb (some time ago) Reviewed by: bde
# 83787	21-Sep-2001	jhb	GC some #if 0'd code.
# 83786	21-Sep-2001	jhb	Whitespace and spelling fixes.
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 82711	01-Sep-2001	dillon	Make yield() MPSAFE. Synchronize syscalls.master with all MPSAFE changes to date. Synchronize new syscall generation follows because yield() will panic if it is out of sync with syscalls.master.
# 82117	21-Aug-2001	jhb	Release the sched_lock before bombing out in mi_switch() via db_error(). This makes things slightly easier if you call a function that calls mi_switch() as it keeps the locking before and after closer.
# 82096	21-Aug-2001	jhb	Add a hook to mi_switch() to abort via db_error() if we attempt to perform a context switch from DDB. Consulting from: bde
# 82085	21-Aug-2001	jhb	- Fix a bug in the previous workaround for the tsleep/endtsleep race. callout_stop() would fail in two cases: 1) The timeout was currently executing, and 2) The timeout had already executed. We only needed to work around the race for 1). We caught some instances of 2) via the PS_TIMEOUT flag, however, if endtsleep() fired after the process had been woken up but before it had resumed execution, PS_TIMEOUT would not be set, but callout_stop() would fail, so we would block the process until endtsleep() resumed it. Except that endtsleep() had already run and couldn't resume it. This adds a new flag PS_TIMOFAIL to indicate the case of 2) when PS_TIMEOUT isn't set. - Implement this race fix for condition variables as well. Tested by: sos
# 81493	10-Aug-2001	jhb	- Close races with signals and other AST's being triggered while we are in the process of exiting the kernel. The ast() function now loops as long as the PS_ASTPENDING or PS_NEEDRESCHED flags are set. It returns with preemption disabled so that any further AST's that arrive via an interrupt will be delayed until the low-level MD code returns to user mode. - Use u_int's to store the tick counts for profiling purposes so that we do not need sched_lock just to read p_sticks. This also closes a problem where the call to addupc_task() could screw up the arithmetic due to non-atomic reads of p_sticks. - Axe need_proftick(), aston(), astoff(), astpending(), need_resched(), clear_resched(), and resched_wanted() in favor of direct bit operations on p_sflag. - Fix up locking with sched_lock some. In addupc_intr(), use sched_lock to ensure pr_addr and pr_ticks are updated atomically with setting PS_OWEUPC. In ast() we clear pr_ticks atomically with clearing PS_OWEUPC. We also do not grab the lock just to test a flag. - Simplify the handling of Giant in ast() slightly. Reviewed by: bde (mostly)
# 81482	10-Aug-2001	jhb	Work around a race between msleep() and endtsleep() where it was possible for endtsleep() to be executing when msleep() resumed, for endtsleep() to spin on sched_lock long enough for the other process to loop on msleep() and sleep again resulting in endtsleep() waking up the "wrong" msleep. Obtained from: BSD/OS
# 81479	10-Aug-2001	jhb	Style nit: covert a couple of if (p_wchan) tests to if (p_wchan != NULL).
# 81397	10-Aug-2001	jhb	- Remove asleep(), await(), and M_ASLEEP. - Callers of asleep() and await() have been converted to calling tsleep(). The only caller outside of M_ASLEEP was the ata driver, which called both asleep() and await() with spl-raised, so there was no need for the asleep() and await() pair. M_ASLEEP was unused. Reviewed by: jasone, peter
# 81072	02-Aug-2001	jhb	Use 'p' instead of the potentially more expensive 'curproc' inside of mi_switch().
# 80766	31-Jul-2001	jhb	Apply the cluebat to myself and undo the await() -> mawait() rename. The asleep() and await() functions split the functionality of msleep() up into two halves. Only the asleep() half (which is what puts the process on the sleep queue) actually needs the lock usually passed to msleep() held to prevent lost wakeups. await() does not need the lock held, so the lock can be released prior to calling await() and does not need to be passed in to the await() function. Typical usage of these functions would be as follows: mtx_lock(&foo_mtx); ... do stuff ... asleep(&foo_cond, PRIxx, "foowt", hz); ... mtx_unlock&foo_mtx); ... await(-1, -1); Inspired by: dillon on the couch at Usenix
# 80761	31-Jul-2001	jhb	Add a safety belt to mawait() for the (cold \|\| panicstr) case identical to the one in msleep() such that we return immediately rather than blocking. Submitted by: peter Prodded by: sheldonh
# 79343	05-Jul-2001	jake	Backout mwakeup, etc.
# 79172	03-Jul-2001	jake	Implement mwakeup, mwakeup_one, cv_signal_drop and cv_broadcast_drop. These take an additional mutex argument, which is dropped before any processes are made runnable. This can avoid contention on the mutex if the processes would immediately acquire it, and is done in such a way that wakeups will not be lost. Reviewed by: jhb
# 79132	03-Jul-2001	jhb	Remove commented-out garbage that skipped updating schedcpu() stats for ithreads in SWAIT.
# 79131	03-Jul-2001	jhb	Just check p_oncpu when determining if a process is executing or not. We already did this in the SMP case, and it is now maintained in the UP case as well, and makes the code slightly more readable. Note that curproc is always executing, thus the p != curproc test does not need to be performed if the p_oncpu check is made.
# 79130	03-Jul-2001	jhb	Axe spl's that are covered by the sched_lock (and have been for quite some time.)
# 79128	03-Jul-2001	jhb	Include the wait message and channel for msleep() in the KTR tracepoint.
# 79126	03-Jul-2001	jhb	Remove bogus need_resched() of the current CPU in roundrobin(). We don't actually need to force a context switch of the current process. The act of firing the event triggers a context switch to softclock() and then switching back out again which is equivalent to a preemption, thus no further work is needed on the local CPU.
# 79003	30-Jun-2001	jhb	Make the schedlock saved critical section state a per-thread property.
# 78638	22-Jun-2001	jhb	- Lock CURSIG() with the proc lock to close the signal race with psignal. - Grab Giant around ktrace points. - Clean up KTR_PROC tracepoints to not display the value of sched_lock.mtx_lock as it isn't really needed anymore and just obfuscates the messages. - Add a few if conditions to replace gotos. - Ensure that every msleep KTR event ends up with a matching msleep resume KTR event (this was broken when we didn't do a mi_switch()). - Only note via ktrace that we resumed from a switch once rather than twice in several places in msleep(). - Remove spl's rom asleep and await as the proc lock and sched_lock provide all the needed locking. - In mawait() add in a needed ktrace point for noting that we are about to switch out.
# 77059	23-May-2001	jhb	Add in assertions to ensure that we always call msleep or mawait with either a timeout or a held mutex to detect unprotected infinite sleeps that can easily lead to deadlock. Submitted by: alfred
# 76950	21-May-2001	alfred	Remove KASSERT test for sleeping on mv_mtx, instead let WITNESS catch it. Requested by: jhb
# 76830	18-May-2001	alfred	remove my private assertions from tsleep. add one assertion to ensure we don't sleep while holding vm.
# 76827	18-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 76647	15-May-2001	jhb	- Remove unneeded include of sys/ipl.h. - Lock the process before calling killproc() to kill it for exceeding the maximum CPU limit.
# 76078	27-Apr-2001	jhb	Overhaul of the SMP code. Several portions of the SMP kernel support have been made machine independent and various other adjustments have been made to support Alpha SMP. - It splits the per-process portions of hardclock() and statclock() off into hardclock_process() and statclock_process() respectively. hardclock() and statclock() call the _process() functions for the current process so that UP systems will run as before. For SMP systems, it is simply necessary to ensure that all other processors execute the _process() functions when the main clock functions are triggered on one CPU by an interrupt. For the alpha 4100, clock interrupts are delievered in a staggered broadcast fashion, so we simply call hardclock/statclock on the boot CPU and call the _process() functions on the secondaries. For x86, we call statclock and hardclock as usual and then call forward_hardclock/statclock in the MD code to send an IPI to cause the AP's to execute forwared_hardclock/statclock which then call the _process() functions. - forward_signal() and forward_roundrobin() have been reworked to be MI and to involve less hackery. Now the cpu doing the forward sets any flags, etc. and sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically return so that they can execute ast() and don't bother with setting the astpending or needresched flags themselves. This also removes the loop in forward_signal() as sched_lock closes the race condition that the loop worked around. - need_resched(), resched_wanted() and clear_resched() have been changed to take a process to act on rather than assuming curproc so that they can be used to implement forward_roundrobin() as described above. - Various other SMP variables have been moved to a MI subr_smp.c and a new header sys/smp.h declares MI SMP variables and API's. The IPI API's from machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h. - The globaldata_register() and globaldata_find() functions as well as the SLIST of globaldata structures has become MI and moved into subr_smp.c. Also, the globaldata list is only available if SMP support is compiled in. Reviewed by: jake, peter Looked over by: eivind
# 74927	28-Mar-2001	jhb	Convert the allproc and proctree locks from lockmgr locks to sx locks.
# 74912	28-Mar-2001	jhb	Rework the witness code to work with sx locks as well as mutexes. - Introduce lock classes and lock objects. Each lock class specifies a name and set of flags (or properties) shared by all locks of a given type. Currently there are three lock classes: spin mutexes, sleep mutexes, and sx locks. A lock object specifies properties of an additional lock along with a lock name and all of the extra stuff needed to make witness work with a given lock. This abstract lock stuff is defined in sys/lock.h. The lockmgr constants, types, and prototypes have been moved to sys/lockmgr.h. For temporary backwards compatability, sys/lock.h includes sys/lockmgr.h. - Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin locks held. By making this per-cpu, we do not have to jump through magic hoops to deal with sched_lock changing ownership during context switches. - Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with proc->p_sleeplocks, which is a list of held sleep locks including sleep mutexes and sx locks. - Add helper macros for logging lock events via the KTR_LOCK KTR logging level so that the log messages are consistent. - Add some new flags that can be passed to mtx_init(): - MTX_NOWITNESS - specifies that this lock should be ignored by witness. This is used for the mutex that blocks a sx lock for example. - MTX_QUIET - this is not new, but you can pass this to mtx_init() now and no events will be logged for this lock, so that one doesn't have to change all the individual mtx_lock/unlock() operations. - All lock objects maintain an initialized flag. Use this flag to export a mtx_initialized() macro that can be safely called from drivers. Also, we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness performs the corresponding checks using the initialized flag. - The lock order reversal messages have been improved to output slightly more accurate file and line numbers.
# 73915	07-Mar-2001	jhb	- Proc locking. - Remove unneeded spl()'s.
# 72886	22-Feb-2001	jhb	Add a mtx_assert() in maybe_resched() just to be sure it's always called with sched_lock held.
# 72828	21-Feb-2001	jhb	- Use the new NOCPU constant. - Fix a warning. Noticed by: bde (2)
# 72746	20-Feb-2001	jhb	- Don't call clear_resched() in userret(), instead, clear the resched flag in mi_switch() just before calling cpu_switch() so that the first switch after a resched request will satisfy the request. - While I'm at it, move a few things into mi_switch() and out of cpu_switch(), specifically set the p_oncpu and p_lastcpu members of proc in mi_switch(), and handle the sched_lock state change across a context switch in mi_switch(). - Since cpu_switch() no longer handles the sched_lock state change, we have to setup an initial state for sched_lock in fork_exit() before we release it.
# 72376	11-Feb-2001	jake	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
# 72330	10-Feb-2001	jake	Acquire sched_lock around need_resched() in roundrobin() to satisfy assertions that it is held. Since roundrobin() is a timeout there's no possible way that it could be called with sched_lock held.
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71858	31-Jan-2001	peter	Zap last remaining references to (and a use use of) of simple_locks.
# 71564	24-Jan-2001	jhb	- Catch up to proc flag changes. - Add in some locking ops that might fix SIGXCPU, but don't enable them yet. - Assert that sched_lock is not recursed when mi_switch() is called.
# 71436	23-Jan-2001	mjacob	Do not do the commenting out the way that saves bytes and looks cleaner to you. Do it the way Vox Populi wants it.
# 71391	22-Jan-2001	mjacob	Move (now) unused variable declaration inside the block (now commented out).
# 71288	20-Jan-2001	jhb	Temporarily disable the printf() for micruptime() going backwards, the SIGXCPU signal, and killing of processes that exceed their allowed run time until they can play nice with sched_lock. Right now they are just potentital panics waiting to happen. The printf() has bitten several people.
# 71088	15-Jan-2001	jasone	Implement condition variables.
# 70861	10-Jan-2001	jake	Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables other then curproc.
# 69947	12-Dec-2000	jake	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
# 69645	05-Dec-2000	jhb	Add in #include of <sys/lock.h> since it was axed from <sys/proc.h>. Noticed by: Wesley Morgan <morganw@chemikals.org> Pointy hat to: me
# 69512	02-Dec-2000	jake	Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup from struct proc, which are now unused (p_nthread already was). Remove process flag P_KTHREADP which was untested and only set in vfs_aio.c (it should use kthread_create). Move the yield system call to kern_synch.c as kern_threads.c has been removed completely. moral support from: alfred, jhb
# 69437	01-Dec-2000	jake	Use an mp-safe callout for endtsleep.
# 69376	29-Nov-2000	jhb	Fix up priority propagation: - Use a better test for determining when a process is running. - Convert some checks to assertions. - Remove unnecessary tests. - Save the priority before acquiring a mutex rather than in msleep(9).
# 69360	29-Nov-2000	jhb	Don't drop Giant and the passed in mutex incorrectly in the cold \|\| panicstr case. Do drop the passed in mutex in that case if PDROP is specified.
# 69286	27-Nov-2000	jake	Use callout_reset instead of timeout(9). Most callouts are statically allocated, 2 have been added to struct proc for setitimer and sleep. Reviewed by: jhb, jlemon
# 69022	22-Nov-2000	jake	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
# 68862	17-Nov-2000	jake	- Split the run queue and sleep queue linkage, so that a process may block on a mutex while on the sleep queue without corrupting it. - Move dropping of Giant to after the acquire of sched_lock. Tested by: John Hay <jhay@icomtek.csir.co.za> jhb
# 68808	16-Nov-2000	jhb	Don't release and acquire Giant in mi_switch(). Instead, release and acquire Giant as needed in functions that call mi_switch(). The releases need to be done outside of the sched_lock to avoid potential deadlocks from trying to acquire Giant while interrupts are disabled. Submitted by: witness
# 68805	15-Nov-2000	jhb	Argh, add in a missing release of the sched_lock.
# 68804	15-Nov-2000	jhb	CURSIG() calls functions that acquire sleep mutexes, so it is not a good idea to be holding the sched_lock while we are calling it. As such, release sched_lock before calling CURSIG() in msleep() and mawait() and reacquire it after CURSIG() returns. Submitted by: witness
# 68794	15-Nov-2000	jhb	- Rename await() to mawait(). mawait() is to await() as msleep() is to tsleep(). Namely, mawait() takes an extra argument which is a mutex to drop when going to sleep. Just as with msleep(), if the priority argument includes the PDROP flag, then the mutex will be dropped and will not be reacquired when the process wakes up. - Add in a backwards compatible macro await() that passes in NULL as the mutex argument to mawait().
# 68793	15-Nov-2000	jhb	- Replace a KASSERT() that knew too much about mutex internals with a mtx_assert() that ensures the mutex we release during msleep() is both not recursed and owned by the current process.
# 68792	15-Nov-2000	jhb	- Convert references from tsleep() -> msleep() - Fix a buglet in a comment above await()
# 67362	20-Oct-2000	jhb	- GC some #if 0'd code regarding the non-existant safepri variable. - Don't dink with the witness state of Giant unless we actually own it during mi_switch().
# 66716	06-Oct-2000	jhb	- Change fast interrupts on x86 to push a full interrupt frame and to return through doreti to handle ast's. This is necessary for the clock interrupts to work properly. - Change the clock interrupts on the x86 to be fast instead of threaded. This is needed because both hardclock() and statclock() need to run in the context of the current process, not in a separate thread context. - Kill the prevproc hack as it is no longer needed. - We really need Giant when we call psignal(), but we don't want to block during the clock interrupt. Instead, use two p_flag's in the proc struct to mark the current process as having a pending SIGVTALRM or a SIGPROF and let them be delivered during ast() when hardclock() has finished running. - Remove CLKF_BASEPRI, which was #ifdef'd out on the x86 anyways. It was broken on the x86 if it was turned on since cpl is gone. It's only use was to bogusly run softclock() directly during hardclock() rather than scheduling an SWI. - Remove the COM_LOCK simplelock and replace it with a clock_lock spin mutex. Since the spin mutex already handles disabling/restoring interrupts appropriately, this also lets us axe all the *_intr() fu. - Back out the hacks in the APIC_IO x86 cpu_initclocks() code to use temporary fast interrupts for the APIC trial. - Add two new process flags P_ALRMPEND and P_PROFPEND to mark the pending signals in hardclock() that are to be delivered in ast(). Submitted by: jakeb (making statclock safe in a fast interrupt) Submitted by: cp (concept of delaying signals until ast())
# 66314	23-Sep-2000	jhb	Add a KASSERT() to catch instances where the mutex that we pass in to msleep() are recursed. Suggested by: cp
# 65856	14-Sep-2000	jhb	Remove the mtx_t, witness_t, and witness_blessed_t types. Instead, just use struct mtx, struct witness, and struct witness_blessed. Requested by: bde
# 65708	10-Sep-2000	jake	Rename tsleep to msleep and add a mutex argument, which is released before sleeping and re-acquired before msleep returns. A compatibility cpp macro has been provided for tsleep to avoid changing all occurences of it in the kernel. Remove an assertion that the Giant mutex be held before calling tsleep or asleep. This is intended to serve the same purpose as condition variables, but does not preclude their addition in the future. Approved by: jasone Obtained from: BSD/OS
# 65682	10-Sep-2000	dfr	Fix printf warnings in CTRx calls.
# 65557	06-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 62573	04-Jul-2000	phk	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
# 62454	03-Jul-2000	phk	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 60120	07-May-2000	grog	Correct a couple of typos.
# 59792	30-Apr-2000	green	Change the scheduler to actually respect the PUSER barrier. It's been wrong for many years that negative niceness would lower the priority of a process below PUSER, and once below PUSER, there were conditionals in the code that are required to test for whether a process was in the kernel which would break. The breakage could (and did) cause lock-ups, basically nothing else but the least nice program being able to run in some conditions. The algorithm which adjusts the priority now subtracts PRIO_MIN to do things properly, and the ESTCPULIM() algorithm was updated to use PRIO_TOTAL (PRIO_MAX - PRIO_MIN) to calculate the estcpu. NICE_WEIGHT is now 1 to accomodate the full range of priorities better (a -20 process with full CPU time has the priority of a +0 process with no CPU time). There are now 20 queues (exactly; 80 priorities) for use in user processes' scheduling, and PUSER has been lowered to 48 to accomplish this. This means, to the user, that things will be scheduled more correctly (noticeable), there is no lock-up anymore WRT a niced -20 process never releasing the CPU time for other processes. In this fair system, tsleep()ed < PUSER processes now will get the proper higher priority than priority >= PUSER user processes. The detective work of this was done by me, along with part of the solution. Luoqi Chen has provided most of the solution, and really helped me understand what was happening better, to boot :) Submitted by: luoqi Concept reviewed by: bde
# 58755	28-Mar-2000	dillon	The SMP cleanup commit broke UP compiles. Make UP compiles work again.
# 58717	28-Mar-2000	dillon	Commit major SMP cleanups and move the BGL (big giant lock) in the syscall path inward. A system call may select whether it needs the MP lock or not (the default being that it does need it). A great deal of conditional SMP code for various deadended experiments has been removed. 'cil' and 'cml' have been removed entirely, and the locking around the cpl has been removed. The conditional separately-locked fast-interrupt code has been removed, meaning that interrupts must hold the CPL now (but they pretty much had to anyway). Another reason for doing this is that the original separate-lock for interrupts just doesn't apply to the interrupt thread mechanism being contemplated. Modifications to the cpl may now ONLY occur while holding the MP lock. For example, if an otherwise MP safe syscall needs to mess with the cpl, it must hold the MP lock for the duration and must (as usual) save/restore the cpl in a nested fashion. This is precursor work for the real meat coming later: avoiding having to hold the MP lock for common syscalls and I/O's and interrupt threads. It is expected that the spl mechanisms and new interrupt threading mechanisms will be able to run in tandem, allowing a slow piecemeal transition to occur. This patch should result in a moderate performance improvement due to the considerable amount of code that has been removed from the critical path, especially the simplification of the spl*() calls. The real performance gains will come later. Approved by: jkh Reviewed by: current, bde (exception.s) Some work taken from: luoqi's patch
# 57704	02-Mar-2000	dufault	I applied the wrong patch set. Back out anything associated with the known bogus currtpriority. This undoes the previous changes to sys/i386/i386/trap.c, sys/alpha/alpha/trap.c, sys/sys/systm.h Now we have the patch set approved by bde. Approved by: bde
# 57701	02-Mar-2000	dufault	Patches that eliminate extra context switches in FIFO case. Fixes p1003_1b regression test in the simple case of no RR and FIFO processes competing. Reviewed by: jkh, bde
# 53944	30-Nov-1999	peter	Don't make the ktrace hook in tsleep() deref a null curproc after a panic. PR: 15169 Submitted by: David Gilbert <dgilbert@velocet.ca>
# 53880	29-Nov-1999	phk	Add a bit of sanity checking and problem avoidance in case the timecounter hardware is bogus. This will produce a new warning "microuptime() went backwards" and try to not screw up the process resource accounting.
# 53822	28-Nov-1999	bde	Scheduler fixes equivalent to the ones logged in the following NetBSD commit to kern_synch.c: ---------------------------- revision 1.55 date: 1999/02/23 02:56:03; author: ross; state: Exp; lines: +39 -10 Scheduler bug fixes and reorganization * fix the ancient nice(1) bug, where nice +20 processes incorrectly steal 10 - 20% of the CPU, (or even more depending on load average) * provide a new schedclk() mechanism at a new clock at schedhz, so high platform hz values don't cause nice +0 processes to look like they are niced * change the algorithm slightly, and reorganize the code a lot * fix percent-CPU calculation bugs, and eliminate some no-op code === nice bug === Correctly divide the scheduler queues between niced and compute-bound processes. The current nice weight of two (sort of, see `algorithm change' below) neatly divides the USRPRI queues in half; this should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op, and it was done after decay_cpu() which can only _reduce_ the value. It has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can scheduler-penalize themselves onto the same queue as nice +20 processes. (Or even a higher one.) === New schedclk() mechansism === Some platforms should be cutting down stathz before hitting the scheduler, since the scheduler algorithm only works right in the vicinity of 64 Hz. Rather than prescale hz, then scale back and forth by 4 every time p_estcpu is touched (each occurance an abstraction violation), use p_estcpu without scaling and require schedhz to be generated directly at the right frequency. Use a default stathz (well, actually, profhz) / 4, so nothing changes unless a platform defines schedhz and a new clock. Define these for alpha, where hz==1024, and nice was totally broke. === Algorithm change === The nice value used to be added to the exponentially-decayed scheduler history value p_estcpu, in _addition_ to be incorporated directly (with greater wieght) into the priority calculation. At first glance, it appears to be a pointless increase of 1/8 the nice effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that because it will ramp up linearly but be decayed only exponentially, thus converging to an additional .75 nice for a loadaverage of one. I killed this, it makes the behavior hard to control, almost impossible to analyze, and the effect (~~nothing at for the first second, then somewhat increased niceness after three seconds or more, depending on load average) pointless. === Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation. Collect scheduler functionality. Try to put each abstraction in just one place. ---------------------------- The details are a little different in FreeBSD: === nice bug === Fixing this is the main point of this commit. We use essentially the same clipping rule as NetBSD (our limit on p_estcpu differs by a scale factor). However, clipping at all is fundamentally bad. It gives free CPU the hoggiest hogs once they reach the limit, and reaching the limit is normal for long-running hogs. This will be fixed later. === New schedclk() mechanism === We don't use the NetBSD schedclk() (now schedclock()) mechanism. We require (real)stathz to be about 128 and scale by an extra factor of 2 compared with NetBSD's statclock(). We scale p_estcpu instead of scaling the clock. This is more accurate and flexible. === Algorithm change === Same change. === Other bugs === The p_pctcpu bug was fixed long ago. We don't try as hard to abstract functionality yet. Related changes: the new limit on p_estcpu must be exported to kern_exit.c for clipping in wait1(). Agreed with by: dufault
# 53752	27-Nov-1999	bde	Updated comments for the move in the previous commit.
# 53745	27-Nov-1999	bde	Moved scheduling-related code to kern_synch.c so that it is easier to fix and extend. The new function containing the code is named schedclock() as in NetBSD, but it has slightly different semantics (it already handles incrementation of p->p_cpticks, and it should handle any calling frequency). Agreed with in principle by: dufault
# 53212	16-Nov-1999	phk	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 51791	29-Sep-1999	marcel	sigset_t change (part 2 of 5) ----------------------------- The core of the signalling code has been rewritten to operate on the new sigset_t. No methodological changes have been made. Most references to a sigset_t object are through macros (see signalvar.h) to create a level of abstraction and to provide a basis for further improvements. The NSIG constant has not been changed to reflect the maximum number of signals possible. The reason is that it breaks programs (especially shells) which assume that all signals have a non-null name in sys_signame. See src/bin/sh/trap.c for an example. Instead _SIG_MAXSIG has been introduced to hold the maximum signal possible with the new sigset_t. struct sigprop has been moved from signalvar.h to kern_sig.c because a) it is only used there, and b) access must be done though function sigprop(). The latter because the table doesn't holds properties for all signals, but only for the first NSIG signals. signal.h has been reorganized to make reading easier and to add the new and/or modified structures. The "old" structures are moved to signalvar.h to prevent namespace polution. Especially the coda filesystem suffers from the change, because it contained lines like (p->p_sigmask == SIGIO), which is easy to do for integral types, but not for compound types. NOTE: kdump (and port linux_kdump) must be recompiled. Thanks to Garrett Wollman and Daniel Eischen for pressing the importance of changing sigreturn as well.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50033	18-Aug-1999	peter	Don't initialize run queues here, do it all in one place.
# 44487	05-Mar-1999	bde	The magic "no-cpu" cpu number is 0xff. Don't misrepresent cpu numbers as chars or use bogus casts in an attempt to unmisrepresnt them. In top, don't assume that 0xff is the only negative cpu number when cpu numbers are (mis)represented.
# 44452	03-Mar-1999	julian	The tunable parameter for the scheduler quantum was inverted. Higher numbers led to smaller quanta. In discussion with BDE, change this parameter to be in uSecs to make it machine independent, and limit it to non zero multiples of 'tick' (rounding down). Also make the variabel globally available so that the present function that returns its value (used for posix scheduling I believe) can go away. Submitted by: Bruce Evans <bde@freebsd.org>
# 44327	28-Feb-1999	bde	Removed all traces of `p_switchtime'. The relevant timestamp is per-cpu, not per-process. Keep it in `switchtime' consistently. It is now clear that the timestamp is always valid in fork_trampoline() except when the child is running on a previously idle cpu, which can only happen if there are multiple cpus, so don't check or set the timestamp in fork_trampoline except in the (i386) SMP case. Just remove the alpha code for setting it unconditionally, since there is no SMP case for alpha and the code had rotted. Parts reviewed by: dfr, phk
# 44218	22-Feb-1999	bde	Improved scheduling in uiomove(), etc. resched_wanted() is true too often for it to be a good criterion for switching kernel cpu hogs -- it is true after most wakeups. Use the criterion "has been running for >= 2 quanta" instead.
# 42453	09-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 41971	21-Dec-1998	dillon	Add asleep() and await() support. Currently highly experimental. A small support structure had to be added to the proc structure, and a few minor conditional panics no longer apply.
# 41373	27-Nov-1998	dg	Compare p_cpulimit with RLIM_INFINITY before comparing it with the process runtime. p_runtime is unsigned while p_cpulimit is not, so this avoids the nasty side effect of the process getting killed when the runtime comes up "negative" due to other bugs.
# 41361	26-Nov-1998	bde	Fixed the previous fix - stathz doesn't give the statclock frequency when it is 0. Submitted by: mostly by Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
# 41360	26-Nov-1998	bde	Oops, yet again back out some local changes that shouldn't have been in the previous commit.
# 41359	26-Nov-1998	bde	Fixed scaling of p_pctcpu. It was wrong by a factor of stathz/hz. Until recently, this was half compensated for in at least ps and top by multiplying by 100/stathz to get a better wrong factor of 100/hz.
# 40653	25-Oct-1998	bde	Oops, back out some local changes that shouldn't have been in the previous commit.
# 40652	25-Oct-1998	bde	Fixed breakage of the !SMP case of roundrobin() in the previous commit.
# 40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 38551	26-Aug-1998	dillon	priority comparison in maybe_resched() didn't work properly if current and chk process were on different scheduler queues. Fixed.
# 37649	15-Jul-1998	bde	Cast pointers to uintptr_t/intptr_t instead of to u_long/long, respectively. Most of the longs should probably have been u_longs, but this changes is just to prevent warnings about casts between pointers and integers of different sizes, not to fix poorly chosen types.
# 37565	11-Jul-1998	bde	Moved definition of fscale from param.c to kern_synch.c where it should always have been (it has no user-servicable parts even at compile time) and staticized it.
# 37315	30-Jun-1998	phk	Add 3 sysctl variables for future use by ps)1_
# 37101	21-Jun-1998	bde	Removed unused includes.
# 36441	28-May-1998	phk	Some cleanups related to timecounters and weird ifdefs in <sys/time.h>. Clean up (or if antipodic: down) some of the msgbuf stuff. Use an inline function rather than a macro for timecounter delta. Maintain process "on-cpu" time as 64 bits of microseconds to avoid needless second rollover overhead. Avoid calling microuptime the second time in mi_switch() if we do not pass through _idle in cpu_switch() This should reduce our context-switch overhead a bit, in particular on pre-P5 and SMP systems. WARNING: Programs which muck about with struct proc in userland will have to be fixed. Reviewed, but found imperfect by: bde
# 36135	17-May-1998	tegge	Add forwarding of roundrobin to other cpus. This gives a more regular update of cpu usage as shown by top when one process is cpu bound (no system calls) while the system is otherwise idle (except for top). Don't attempt to switch to the BSP in boot(). If the system was idle when an interrupt caused a panic, this won't work. Instead, switch to the BSP in cpu_reset. Remove some spurious forward_statclock/forward_hardclock warnings.
# 36119	17-May-1998	phk	s/nanoruntime/nanouptime/g s/microruntime/microuptime/g Reviewed by: bde
# 35029	04-Apr-1998	phk	Time changes mark 2: * Figure out UTC relative to boottime. Four new functions provide time relative to boottime. * move "runtime" into struct proc. This helps fix the calcru() problem in SMP. * kill mono_time. * add timespec{add\|sub\|cmp} macros to time.h. (XXX: These may change!) * nanosleep, select & poll takes long sleeps one day at a time Reviewed by: bde Tested by: ache and others
# 34931	28-Mar-1998	dufault	Remove duplicate comment
# 34925	28-Mar-1998	dufault	Finish _POSIX_PRIORITY_SCHEDULING. Needs P1003_1B and _KPOSIX_PRIORITY_SCHEDULING options to work. Changes: Change all "posix4" to "p1003_1b". Misnamed files are left as "posix4" until I'm told if I can simply delete them and add new ones; Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux; Add man pages for _POSIX_PRIORITY_SCHEDULING system calls; Add options to LINT; Minor fixes to P1003_1B code during testing.
# 34924	28-Mar-1998	bde	Moved some #includes from <sys/param.h> nearer to where they are actually used.
# 34489	11-Mar-1998	dufault	idprio processes must be preempted as soon as anything is runnable.
# 34266	08-Mar-1998	julian	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 34029	04-Mar-1998	dufault	Reviewed by: msmith, bde long ago Fix for RTPRIO scheduler to eliminate invalid context switches. POSIX.4 headers and sysctl variables. Nothing should change unless POSIX4 is defined or _POSIX_VERSION is set to 199309.
# 33823	25-Feb-1998	bde	Fixed a missing newline in a debugging printf. Fixed punctuation in some comments.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32081	29-Dec-1997	bde	Fixed style bugs in previous commit.
# 32071	28-Dec-1997	dyson	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 31403	25-Nov-1997	julian	Shift a few SYSINT() calls around. this results in a few functions becoming static, and the SYSINITs being close to the code they are related to. setting up the dump device is with dumpsys() and kicking off the scheduler is with the scheduler. Mounting root is with the code that does it. Reviewed by: phk
# 31352	22-Nov-1997	bde	Staticized.
# 31333	21-Nov-1997	bde	Const poisoning from ks_shortdesc.
# 29680	21-Sep-1997	gibbs	init_main.c subr_autoconf.c: Add support for "interrupt driven configuration hooks". A component of the kernel can register a hook, most likely during auto-configuration, and receive a callback once interrupt services are available. This callback will occur before the root and dump devices are configured, so the configuration task can affect the selection of those two devices or complete any tasks that need to be performed prior to launching init. System boot is posponed so long as a hook is registered. The hook owner is responsible for removing the hook once their task is complete or the system boot can continue. kern_acct.c kern_clock.c kern_exit.c kern_synch.c kern_time.c: Change the interface and implementation for the kernel callout service. The new implemntaion is based on the work of Adam M. Costello and George Varghese, published in a technical report entitled "Redesigning the BSD Callout and Timer Facilities". The interface used in FreeBSD is a little different than the one outlined in the paper. The new function prototypes are: struct callout_handle timeout(void (func)(void ), void arg, int ticks); void untimeout(void (func)(void ), void arg, struct callout_handle handle); If a client wishes to remove a timeout, it must store the callout_handle returned by timeout and pass it to untimeout. The new implementation gives 0(1) insert and removal of callouts making this interface scale well even for applications that keep 100s of callouts outstanding. See the updated timeout.9 man page for more details.
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 28551	21-Aug-1997	bde	#include <machine/limits.h> explicitly in the few places that it is required.
# 28342	17-Aug-1997	julian	Take verbal beating by wollman into account and fix DIAGNOSTIC test. This version. 1/ avoids garret's introduced potential page fault. (I got one) 2/ removes compiler warnings Also fix the tunable scheduling quantum to return a better error code when fed a bad argument.
# 28268	16-Aug-1997	wollman	Dejulianize DIAGNOSTIC panic code. The types are wrong; probably there's a missing dereference.
# 28176	13-Aug-1997	julian	add a diagnostic to catch some common cases of tsleep being called from the wrong place.
# 27989	08-Aug-1997	julian	Make the scheduler quantum a tunable parameter Reviewd by: John Dyson dyson@freebsd.org
# 26812	22-Jun-1997	peter	Preliminary support for per-cpu data pages. This eliminates a lot of #ifdef SMP type code. Things like _curproc reside in a data page that is unique on each cpu, eliminating the expensive macros like: #define curproc (SMPcurproc[cpunumber()]) There are some unresolved bootstrap and address space sharing issues at present, but Steve is waiting on this for other work. There is still some strictly temporary code present that isn't exactly pretty. This is part of a larger change that has run into some bumps, this part is standalone so it should be safe. The temporary code goes away when the full idle cpu support is finished. Reviewed by: fsmp, dyson
# 25164	26-Apr-1997	peter	Man the liferafts! Here comes the long awaited SMP -> -current merge! There are various options documented in i386/conf/LINT, there is more to come over the next few days. The kernel should run pretty much "as before" without the options to activate SMP mode. There are a handful of known "loose ends" that need to be fixed, but have been put off since the SMP kernel is in a moderately good condition at the moment. This commit is the result of the tinkering and testing over the last 14 months by many people. A special thanks to Steve Passe for implementing the APIC code!
# 23163	27-Feb-1997	bde	Wrapped mi_switch() with splstatclock()/splx(). This fixes excessive interrupt latency for certain cases involving for restarting stopped processes.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 18974	17-Oct-1996	dyson	Make processes waken up eligible for immediate swap-in.
# 18459	22-Sep-1996	gpalmer	Remove the code that renices +4 a process that has had 10 minutes of CPU time. I find it slightly annoying on one of our servers here. Also disliked by: David Greenman
# 17989	01-Sep-1996	dg	Change an splstatclock that should be an splhigh into an splhigh. Reviewed by: bde
# 17874	28-Aug-1996	bde	Fixed a stale comment.
# 17369	31-Jul-1996	dg	Changed wakeup_one() to continue looping, possibly waking up additional processes, until it finds one that is not swapped out. Submitted by: dyson
# 17366	31-Jul-1996	dg	Converted timer/run queues to 4.4BSD queue style. Removed old and unused sleep(). Implemented wakeup_one() which may be used in the future to combat the "thundering herd" problem for some special cases. Reviewed by: dyson
# 15105	07-Apr-1996	bde	Don't generate code for the unused function sleep().
# 14525	11-Mar-1996	hsu	Merge in Lite2: proc LIST changes 64-bit fix for alpha add debugging code for locking Reviewed by: david & bde
# 13788	31-Jan-1996	dg	Improved killproc() log message and made it and the other similar message tolerant of p_ucred being invalid. Starting using killproc() where appropriate.
# 13203	03-Jan-1996	wollman	Converted two options over to the new scheme: USER_LDT and KTRACE.
# 12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
# 12577	02-Dec-1995	bde	Completed function declarations and/or added prototypes.
# 12569	02-Dec-1995	bde	Finished (?) cleaning up sysinit stuff.
# 10653	09-Sep-1995	dg	Fixed init functions argument type - caddr_t -> void *. Fixed a couple of compiler warnings.
# 10358	28-Aug-1995	julian	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 5079	12-Dec-1994	dg	Don't mess with already freed structures when a process is being run down.
# 3688	18-Oct-1994	dg	Removed references to bclnlist which we don't use/support/need.
# 3308	02-Oct-1994	phk	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 3291	02-Oct-1994	dg	"idle priority" support. Based on code from Henrik Vestergaard Draboel, but substantially rewritten by me.
# 3098	25-Sep-1994	phk	While in the real world, I had a bad case of being swapped out for a lot of cycles. While waiting there I added a lot of the extra ()'s I have, (I have never used LISP to any extent). So I compiled the kernel with -Wall and shut up a lot of "suggest you add ()'s", removed a bunch of unused var's and added a couple of declarations here and there. Having a lap-top is highly recommended. My kernel still runs, yell at me if you kernel breaks.
# 2441	01-Sep-1994	dg	Realtime priority scheduling support. Submitted by: Henrik Vestergaard Draboel
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources