Cross Reference: /freebsd-9.3-release/sys/kern/sched

History log of /freebsd-9.3-release/sys/kern/sched_ule.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 267654	19-Jun-2014	gjb	Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-9.3-release
# 262057	17-Feb-2014	avg	MFC r258622,258675: dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
# 259986	27-Dec-2013	dim	MFC r259875: In sys/kern/sched_ule.c, remove static function sched_both(), which is unused since r232207.
# 259835	24-Dec-2013	jhb	MFC 258869: Fix an off-by-one error in r228960. The maximum priority delta provided by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting priority value (before nice adjustment) is between SCHED_PRI_MIN and SCHED_PRI_MAX, inclusive.
# 256208	09-Oct-2013	mav	MFC r255363: Micro-optimize cpu_search(), allowing compiler to use more efficient inline ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
# 255569	14-Sep-2013	mav	Temporary revert r255541 since there is no CPU_FFS in stable/9 yet. Sorry.
# 255541	14-Sep-2013	mav	MFC r255363: Micro-optimize cpu_search(), allowing compiler to use more efficient inline ffsl() implementation, when it is available, instead of homegrown iteration. On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
# 252070	21-Jun-2013	gnn	MFC: 249514 Point args[0] not at the thread that is ending but at the one that is starting. This is in line with practice in OpenSolaris. Note that this change is only in ULE and not in the 4BSD scheduler. Once this change settles in (MFC timeout has expired) we'll try it out on 4BSD as well. PR: 177706 Submitted by: Tiwei Bie
# 247150	22-Feb-2013	mav	MFC r242852, r243069: Several optimizations to sched_idletd(): - Do not try to steal load from other CPUs if there was no context switches on this CPU (i.e. it was idle all the time and woke up just for bus mastering or TLB shutdown). If current CPU was idle, then it is quite unlikely that some other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause numerous CPU wakeups, on 24-CPU system load stealing code may consume up to 25% of all CPU time without giving any benefits. - Change code that implements spinning for load to restart spin in case of context switch. Previous code periodically called cpu_idle() even under high interrupt/context switch rate. - Rise spinning threshold to 10KHz, where it gives at least some effect that may worth consumed power.
# 242544	04-Nov-2012	eadler	MFC r241844: remove duplicate semicolons where possible. Approved by: cperciva (implicit)
# 241250	06-Oct-2012	mav	MFC r239194: Allow idle threads to steal second threads from other cores on systems with 8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2) system shown any slowdown from mentioned "excess thrashing". Same time in pbzip2 test with number of threads more then number of CPUs I see up to 10% speedup with SMT disabled and up 5% with SMT enabled. Thinking about trashing I was trying to limit that stealing within same last level cache, but got only worse results. Present code any way prefers to steal threads from topologically closer cores.
# 241249	06-Oct-2012	mav	MFC r239185, r239196: Some minor tunings/cleanups inspired by bde@ after previous commits: - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
# 241248	06-Oct-2012	mav	MFC r239157: Rework r220198 change (by fabient). I believe it solves the problem from the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU.
# 240839	22-Sep-2012	avg	MFC r240513: sched_ule: fix inverted condition in reporting of priority lending via ktr
# 236547	04-Jun-2012	mav	MFC r234066: Microoptimize cpu_search(). According to profiling, it makes one take 6% of CPU time on hackbench with its million of context switches per second, instead of 8% before.
# 236546	04-Jun-2012	mav	MFC r232740: Make kern.sched.idlespinthresh default value adaptive depending of HZ. Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
# 236344	30-May-2012	rstone	MFC r235459 and r235471 r235459: Implement the DTrace sched provider. This implementation aims to be compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated r235471: Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP.
# 234166	12-Apr-2012	mav	MFC r232917: Rewrite thread CPU usage percentage math to not depend on periodic calls with HZ rate through the sched_tick() calls from hardclock(). Potentially it can be used to improve precision, but now it is just minus one more reason to call hardclock() for every HZ tick on every active CPU. SCHED_4BSD never used sched_tick(), but keep it in place for now, as at least SCHED_FBFS existing in patches out of the tree depends on it.
# 233814	02-Apr-2012	jhb	MFC 232700: Add a new sched_clear_name() method to the scheduler interface to clear the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve().
# 233599	28-Mar-2012	mav	MFC r232207, r232454: Rework CPU load balancing in SCHED_ULE: - In sched_pickcpu() be more careful taking previous CPU on SMT systems. Do it only if all other logical CPUs of that physical one are idle to avoid extra resource sharing. - In sched_pickcpu() change general logic of CPU selection. First look for idle CPU, sharing last level cache with previously used one, skipping SMT CPU groups. If none found, search all CPUs for the least loaded one, where the thread with its priority can run now. If none found, search just for the least loaded CPU. - Make cpu_search() compare lowest/highest CPU load when comparing CPU groups with equal load. That allows to differentiate 1+1 and 2+0 loads. - Make cpu_search() to prefer specified (previous) CPU or group if load is equal. This improves cache affinity for more complicated topologies. - Randomize CPU selection if above factors are equal. Previous code tend to prefer CPUs with lower IDs, causing unneeded collisions. - Rework periodic balancer in sched_balance_group(). With cpu_search() more intelligent now, make balansing process flat, removing recursion over the topology tree. That fixes double swap problem and makes load distribution more even and predictable. All together this gives 10-15% performance improvement in many tests on CPUs with SMT, such as Core i7, for number of threads is less then number of logical CPUs. In some tests it also gives positive effect to systems without SMT. Sponsored by: iXsystems, Inc.
# 230691	28-Jan-2012	marius	MFC: r226057 - Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to itself, which sparc64 hardware doesn't support. One way to solve this would be to directly call sched_preempt() instead of issuing a self-IPI. However, quoting jhb@: "On the other hand, you can probably just skip the IPI entirely if we are going to send it to the current CPU. Presumably, once this routine finishes, the current CPU will exit softlock (or will do so "soon") and will then pick the next thread to run based on the adjustments made in this routine, so there's no need to IPI the CPU running this routine anyway. I think this is the better solution. Right now what is probably happening on other platforms is as soon as this routine finishes the CPU processes its self-IPI and causes mi_switch() which will just switch back to the softclock thread it is already running." - With r226054 (MFC'ed to stable/9 in r230690) and the the above change in place, sparc64 now no longer is incompatible with ULE and vice versa. However, powerpc/E500 still is. Submitted by: jhb Reviewed by: jeff
# 230173	15-Jan-2012	avg	MFC r228718: ule: ensure that batch timeshare threads are scheduled fairly
# 230083	13-Jan-2012	jhb	MFC 228960: Cap the priority calculated from the current thread's running tick count at SCHED_PRI_RANGE to prevent overflows in the priority value. This can happen due to irregularities with clock interrupts under certain virtualization environments.
# 230079	13-Jan-2012	jhb	MFC 229429: Some small fixes to CPU accounting for threads: - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible.
# 225736	22-Sep-2011	kensmith	Copy head to stable/9 as part of 9.0-RELEASE release cycle. Approved by: re (implicit)
# 225199	26-Aug-2011	delphij	Fix format strings for KTR_STATE in 4BSD ad ULE schedulers. Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
# 224221	19-Jul-2011	attilio	Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace pollution. That is a step further in the direction of building correct policies for userland and modules on how to deal with the number of maxcpus at runtime. Reported by: jhb Reviewed and tested by: pluknet Approved by: re (kib)
# 222813	07-Jun-2011	attilio	etire the cpumask_t type and replace it with cpuset_t usage. This is intended to fix the bug where cpu mask objects are capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever value. Anyway, as long as several structures in the kernel are statically allocated and sized as MAXCPU, it is suggested to keep it as low as possible for the time being. Technical notes on this commit itself: - More functions to handle with cpuset_t objects are introduced. The most notable are cpusetobj_ffs() (which calculates a ffs(3) for a cpuset_t object), cpusetobj_strprint() (which prepares a string representing a cpuset_t object) and cpusetobj_strscan() (which creates a valid cpuset_t starting from a string representation). - pc_cpumask and pc_other_cpus are target to be removed soon. With the moving from cpumask_t to cpuset_t they are now inefficient and not really useful. Anyway, for the time being, please note that access to pcpu datas is protected by sched_pin() in order to avoid migrating the CPU while reading more than one (possible) word - Please note that size of cpuset_t objects may differ between kernel and userland. While this is not directly related to the patch itself, it is good to understand that concept and possibly use the patch as a reference on how to deal with cpuset_t objects in userland, when accessing kernland members. - KTR_CPUMASK is changed and now is represented through a string, to be set as the example reported in NOTES. Please additively note that no MAXCPU is bumped in this patch, but private testing has been done until to MAXCPU=128 on a real 8x8x2(htt) machine (amd64). Please note that the FreeBSD version is not yet bumped because of the upcoming pcpu changes. However, note that this patch is not targeted for MFC. People to thank for the time spent on this patch: - sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested several revision of the patches and really helped in improving stability of this work. - marius fixed several bugs in the sparc64 implementation and reviewed patches related to ktr. - jeff and jhb discussed the basic approach followed. - kib and marcel made targeted review on some specific part of the patch. - marius, art, nwhitehorn and andreast reviewed MD specific part of the patch. - marius, andreast, gonzo, nwhitehorn and jceel tested MD specific implementations of the patch. - Other people have made contributions on other patches that have been already committed and have been listed separately. Companies that should be mentioned for having participated at several degrees: - Yahoo! for having offered the machines used for testing on big count of CPUs. - The FreeBSD Foundation for having sponsored my devsummit attendance, which has been instrumental. - Sandvine for having offered offices and infrastructure during development. (I really hope I didn't forget anyone, if it happened I apologize in advance).
# 220198	31-Mar-2011	fabient	Clearing the flag when preempting will let the preempted thread run too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
# 217410	14-Jan-2011	jhb	Rework realtime priority support: - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
# 217351	13-Jan-2011	jhb	Introduce two new helper macros to define the priority ranges used for interactive timeshare threads (PRI__INTERACTIVE) and non-interactive timeshare threads (PRI__BATCH) and use these instead of PRI__REALTIME and PRI__TIMESHARE. No functional change. Reviewed by: jeff
# 217291	11-Jan-2011	jhb	Always use PRI_BASE() when checking the base type of a thread's priority class. MFC after: 2 weeks
# 217237	10-Jan-2011	jhb	Fix two harmless off-by-one errors. Reviewed by: jeff MFC after: 2 weeks
# 217078	06-Jan-2011	jhb	- Move sched_fork() later in fork() after the various sections of the new thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
# 216791	29-Dec-2010	davidxu	- Follow r216313, the sched_unlend_user_prio is no longer needed, always use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
# 216313	09-Dec-2010	davidxu	MFp4: It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
# 215240	13-Nov-2010	trasz	Remove unused variables.
# 215102	10-Nov-2010	attilio	Fix typos. Submitted by: gianni MFC after: 3 days
# 214611	31-Oct-2010	davidxu	Use integer for size of cpuset, as it won't be bigger than INT_MAX, This is requested by bge. Also move the sysctl into file kern_cpuset.c, because it should always be there, it is independent of thread scheduler.
# 214510	29-Oct-2010	davidxu	Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset, also add sysconf() key _SC_CPUSET_SIZE to get sysctl value. Submitted by: gcooper
# 212974	21-Sep-2010	jhb	Comment nit, set TDF_NEEDRESCHED after the comment describing why it is done rather than before. MFC after: 1 week
# 212821	18-Sep-2010	avg	kern.sched.topology_spec sysctl: use step of 1 for group levels numeration This is just a cosmetic change for prettier output. 'indent' variable/parameter serves two purposes: it specifies whitespace indentation level and also implies cpu group level/depth. It would have been better to split those two uses, but for now just a simple change. MFC after: 1 week
# 212541	13-Sep-2010	mav	Refactor timer management code with priority to one-shot operation mode. The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
# 212416	10-Sep-2010	mav	Do not IPI CPU that is already spinning for load. It doubles effect of spining (comparing to MWAIT) on some heavly switching test loads.
# 212153	02-Sep-2010	mdf	Fix UP build. MFC after: 2 weeks
# 212115	01-Sep-2010	mdf	Fix a bug with sched_affinity() where it checks td_pinned of another thread in a racy manner, which can lead to attempting to migrate a thread that is pinned to a CPU. Instead, have sched_switch() determine which CPU a thread should run on if the current one is not allowed. KASSERT in sched_bind() that the thread is not yet pinned to a CPU. KASSERT in sched_switch() that only migratable threads or those moving due to a sched_bind() are changing CPUs. sched_affinity code came from jhb@. MFC after: 2 weeks
# 211515	19-Aug-2010	jhb	Remove unused KTRACE includes.
# 210939	06-Aug-2010	jhb	Add a new ipi_cpu() function to the MI IPI API that can be used to send an IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
# 210117	15-Jul-2010	ivoras	A cosmetic change - don't output empty <flags>.
# 209059	11-Jun-2010	jhb	Update several places that iterate over CPUs to use CPU_FOREACH().
# 208983	10-Jun-2010	ivoras	Unconfuse THREAD and SMT flags
# 208982	10-Jun-2010	ivoras	Cosmetic change to XML - less ugly newlines
# 208787	03-Jun-2010	jhb	Assert that the thread lock is held in sched_pctcpu() instead of recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month
# 208391	21-May-2010	jhb	Assert that the thread passed to sched_bind() and sched_unbind() is curthread as those routines are only supported for curthread currently. MFC after: 1 month
# 208165	16-May-2010	rrs	This pushes all of JC's patches that I have in place. I am now able to run 32 cores ok.. but I still will hang on buildworld with a NFS problem. I suspect I am missing a patch for the netlogic rge driver. JC check and see if I am missing anything except your core-mask changes Obtained from: JC
# 202889	23-Jan-2010	attilio	- Fix a race in sched_switch() of sched_4bsd. In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
# 201347	31-Dec-2009	kib	Allow swap out of the kernel stack for the thread with priority greater or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
# 201148	28-Dec-2009	ed	Don't forget to use `void' for sched_balance(). It has no arguments.
# 199764	24-Nov-2009	ivoras	Make ULE process usage (%CPU) accounting usable again by keeping track of the last tick we incremented on. Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru Reviewed by: jeff (who thinks there should be a better way in the future) Approved by: gnn (mentor) MFC after: 3 weeks
# 198854	03-Nov-2009	attilio	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
# 198126	15-Oct-2009	jhb	Fix a sign bug in the handling of nice priorities when computing the interactive score for a thread. Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net Reviewed by: jeff MFC after: 3 days
# 197223	15-Sep-2009	attilio	Fix sched_switch_migrate(): - In 8.x and above the run-queue locks are nomore shared even in the HTT case, so remove the special case. - The deadlock explained in the removed comment here is still possible even with different locks, with the contribution of tdq_lock_pair(). An explanation is here: (hypotesis: a thread needs to migrate on another CPU, thread1 is doing sched_switch_migrate() and thread2 is the one handling the sched_switch() request or in other words, thread1 is the thread that needs to migrate and thread2 is a thread that is going to be preempted, most likely an idle thread. Also, 'old' is referred to the context (in terms of run-queue and CPU) thread1 is leaving and 'new' is referred to the context thread1 is going into. Finally, thread3 is doing tdq_idletd() or sched_balance() and definitively doing tdq_lock_pair()) * thread1 blocks its td_lock. Now td_lock is 'blocked' * thread1 drops its old runqueue lock * thread1 acquires the new runqueue lock * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT through tdq_notify() to the new CPU * thread1 drops the new lock * thread3, scanning the runqueues, locks the old lock * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock pointing to the new runqueue * thread3 wants to acquire the new runqueue lock, but it can't because it is held by thread2 so it spins * thread1 wants to acquire old lock, but as long as it is held by thread3 it can't * thread2 going further, at some point wants to switchin in thread1, but it will wait forever because thread1->td_lock is in blocked state This deadlock has been manifested mostly on 7.x and reported several time on mailing lists under the voice 'spinlock held too long'. Many thanks to des@ for having worked hard on producing suitable textdumps and Jeff for help on the comment wording. Reviewed by: jeff Reported by: des, others Tested by: des, Giovanni Trematerra <giovanni dot trematerra at gmail dot com> (STABLE_7 based version)
# 194779	23-Jun-2009	jeff	- Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE supports arbitrary numbers of cpus rather than being limited by cpumask_t to the number of bits in a long.
# 191676	29-Apr-2009	jeff	- Fix non-SMP build by encapsulating idle spin logic in a macro. Pointy hat to: me
# 191645	29-Apr-2009	jeff	- Fix the FBSDID line.
# 191643	29-Apr-2009	jeff	- Remove the bogus idle thread state code. This may have a race in it and it only optimized out an ipi or mwait in very few cases. - Skip the adaptive idle code when running on SMT or HTT cores. This just wastes cpu time that could be used on a busy thread on the same core. - Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use CG_FLAG_THREAD to mean SMT or HTT. Sponsored by: Nokia
# 189787	14-Mar-2009	jeff	- Fix an error that occurs when mp_ncpu is an odd number. steal_thresh is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week
# 187679	25-Jan-2009	jeff	- Use __XSTRING where I want the define to be expanded. This resulted in sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
# 187357	17-Jan-2009	jeff	- Implement generic macros for producing KTR records that are compatible with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
# 186435	23-Dec-2008	ivoras	Add missing newlines to flags tags of CPU topology, for prettier output. Reviewed by: jeff (original version) Approved by: gnn (mentor) (original version)
# 185047	18-Nov-2008	jhb	When checking to see if another CPU is running its idle thread, examine the thread running on the other CPU instead of the thread being placed on the run queue. Reported by: Ravi Murty @ Intel Reviewed by: jeff
# 184570	02-Nov-2008	ivoras	Increase the initial sbuf size for CPU topology dump to something more usable for newer CPUs. The new value allows 2 x quad core configuration dumps to fit within the initial buffer without reallocations. Approved by: gnn (mentor) (older version) Pointed out by: rdivacky
# 184439	29-Oct-2008	ivoras	Introduce a new sysctl, kern.sched.topology_spec, that returns an XML dump of detected ULE CPU topology. This dump can be used to check the topology detection and for general system information. An example of CPU topology dump is: kern.sched.topology_spec: <groups> <group level="1" cache-level="0"> <cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu> <flags></flags> <children> <group level="2" cache-level="0"> <cpu count="4" mask="0xf">0, 1, 2, 3</cpu> <flags></flags> </group> <group level="2" cache-level="0"> <cpu count="4" mask="0xf0">4, 5, 6, 7</cpu> <flags></flags> </group> </children> </group> </groups> Reviewed by: jeff Approved by: gnn (mentor)
# 180607	19-Jul-2008	jeff	- Check whether we've recorded this tick in ts_ticks on another cpu in sched_tick() to prevent multiple increments for one tick. This pushes the value out of range and breaks priority calculation. Reviewed by: kib Found by: pho/nokia Sponsored by: Nokia MFC after: 3 days
# 179297	24-May-2008	jb	Add the vtime (virtual time) hooks for DTrace.
# 178471	25-Apr-2008	jeff	- Add an integer argument to idle to indicate how likely we are to wake from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia
# 178277	17-Apr-2008	jeff	- Add a metric to describe how busy a processor has been over the last two ticks by counting the number of switches and the load when sched_clock() is called. - If the busy metric exceeds a threshold allow the idle thread to spin waiting for new work for a brief period to avoid using IPIs. This reduces the cost on the sender and receiver as well as reducing wakeup latency considerably when it works. Sponsored by: Nokia
# 178272	17-Apr-2008	jeff	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
# 178215	15-Apr-2008	marcel	Support and switch to the ULE scheduler: o Implement IPI_PREEMPT, o Set td_lock for the thread being switched out, o For ULE & SMP, loop while td_lock points to blocked_lock for the thread being switched in, o Enable ULE by default in GENERIC and SKI,
# 177903	03-Apr-2008	jeff	- Allow static_boost to specify no boost with '0', traditional kernel fixed pri boost with '1' or any priority less than the current thread's priority with a value greater than two. Default the boost to PRI_MIN_TIMESHARE to prevent regular user-space threads from starving threads in the kernel. This prevents these user-threads from also being scheduled as if they are high fixed-priority kernel threads. - Restore the setting of lowpri in tdq_choose(). It has to be either here or in sched_switch(). I accidentally removed it from both places. Tested by: kris
# 177902	03-Apr-2008	jeff	- Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't do this either. Simply check P_NOLOAD. It'd be nice if this was in a thread flag so we didn't have an extra cache miss every time we add and remove a thread from the run-queue.
# 177435	20-Mar-2008	jeff	- Restore runq to manipulating threads directly by putting runq links and rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h
# 177426	20-Mar-2008	jeff	- ULE and 4BSD share only one line of code from sched_newthread() so implement the required pieces in sched_fork_thread(). The td_sched pointer is already setup by thread_init anyway.
# 177376	19-Mar-2008	jeff	- Remove some dead code and comments related to KSE. - Don't set tdq_lowpri on every switch, it should be precisely maintained now. - Add some comments to sched_thread_priority().
# 177368	19-Mar-2008	jeff	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
# 177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 177169	14-Mar-2008	jhb	Make the function prototype for cpu_search() match the declaration so that this still compiles with gcc3.
# 177091	12-Mar-2008	jeff	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# 177085	12-Mar-2008	jeff	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
# 177042	10-Mar-2008	jeff	- Fix the invalid priority panics people are seeing by forcing tdq_runq_add to select the runq rather than hoping we set it properly when we adjusted the priority. This involves the same number of branches as before so should perform identically without the extra fragility. Tested by: bz Reviewed by: bz
# 177011	10-Mar-2008	jeff	- Don't rely on a side effect of sched_prio() to set the initial ts_runq for thread0. Set it directly in sched_setup(). This fixes traps on boot seen on some machines. Reported by: phk
# 177009	10-Mar-2008	jeff	Reduce ULE context switch time by over 25%. - Only calculate timeshare priorities once per tick or when a thread is woken from sleeping. - Keep the ts_runq pointer valid after all priority changes. - Call tdq_runq_add() directly from sched_switch() without passing in via tdq_add(). We don't need to adjust loads or runqs anymore. - Sort tdq and ts_sched according to utilization to improve cache behavior. Sponsored by: Nokia
# 177005	09-Mar-2008	jeff	- Add an implementation of sched_preempt() that avoids excessive IPIs. - Normalize the preemption/ipi setting code by introducing sched_shouldpreempt() so the logical is identical and not repeated between tdq_notify() and sched_setpreempt(). - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock this could have caused us to lose td_flags settings. - Garbage collect some tunables that are no longer relevant.
# 176735	02-Mar-2008	jeff	Add support for the new cpu topology api: - When searching for affinity search backwards in the tree from the last cpu we ran on while the thread still has affinity for the group. This can take advantage of knowledge of shared L2 or L3 caches among a group of cores. - When searching for the least loaded cpu find the least loaded cpu via the least loaded path through the tree. This load balances system bus links, individual cache levels, and hyper-threaded/SMT cores. - Make the periodic balancer recursively balance the highest and lowest loaded cpu across each link. Add support for cpusets: - Convert the cpuset to a simple native cpumask_t while the kernel still only supports cpumask. - Pass the derived cpumask down through the cpu_search functions to restrict the result cpus. - Make the various steal functions resilient to failure since all threads can not run on all cpus any longer. General improvements: - Precisely track the lowest priority thread on every runq with tdq_setlowpri(). Before it was more advisory but this ended up having pathological behaviors. - Remove many #ifdef SMP conditions to simplify the code. - Get rid of the old cumbersome tdq_group. This is more naturally expressed via the cpu_group tree. Sponsored by: Nokia Testing by: kris
# 176734	02-Mar-2008	jeff	- Remove the old smp cpu topology specification with a new, more flexible tree structure that encodes the level of cache sharing and other properties. - Provide several convenience functions for creating one and two level cpu trees as well as a default flat topology. The system now always has some topology. - On i386 and amd64 create a seperate level in the hierarchy for HTT and multi-core cpus. This will allow the scheduler to intelligently load balance non-uniform cores. Presently we don't detect what level of the cache hierarchy is shared at each level in the topology. - Add a mechanism for testing common topologies that have more information than the MD code is able to provide via the kern.smp.topology tunable. This should be considered a debugging tool only and not a stable api. Sponsored by: Nokia
# 176729	02-Mar-2008	jeff	- Add a new sched_affinity() api to be used in the upcoming cpuset implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia
# 175587	23-Jan-2008	jeff	- sched_prio() should only adjust tdq_lowpri if the thread is running or on a run-queue. If the priority is numerically raised only change lowpri if we're certain it will be correct. Some slop is allowed however previously we could erroneously raise lowpri for an idle cpu that a thread had recently run on which lead to errors in load balancing decisions.
# 175348	15-Jan-2008	jeff	- When executing the 'tryself' branch in sched_pickcpu() look at the lowest priority on the queue for the current cpu vs curthread's priority. In the case that curthread is waking up many threads of a lower priority as would happen with a turnstile_broadcast() or wakeup() of many threads this prevents them from all ending up on the current cpu. - In sched_add() make the relationship between a scheduled ithread and the current cpu advisory rather than strict. Only give the ithread affinity for the current cpu if it's actually being scheduled from a hardware interrupt. This prevents it from migrating when it simply blocks on a lock. Sponsored by: Nokia
# 175104	05-Jan-2008	jeff	- Restore timeslicing code for all bit SCHED_FIFO priority classes. Reported by: Peter Jeremy <peterjeremy@optushome.com.au>
# 174847	21-Dec-2007	wkoszek	Make SCHED_ULE buildable with gcc3. Reviewed by: cognet (mentor), jeffr Approved by: cognet (mentor), jeffr
# 174629	15-Dec-2007	jeff	- Re-implement lock profiling in such a way that it no longer breaks the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia
# 174536	11-Dec-2007	davidxu	Fix LOR of thread lock and umtx's priority propagation mutex due to the reworking of scheduler lock. MFC: after 3 days
# 173600	14-Nov-2007	julian	generally we are interested in what thread did something as opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.
# 172887	22-Oct-2007	grehan	Cut over to ULE on PowerPC kern/sched_ule.c - Add __powerpc__ to the list of supported architectures powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by updating the old thread's lock. Note: uniprocessor-only, will require modification for MP support. powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of old thread's lock, making the call a no-op. Reviewed by: marcel, jeffr (slightly older version)
# 172709	16-Oct-2007	sam	ULE works fine on arm; allow it to be used Reviewed by: jeff, cognet, imp MFC after: 1 week
# 172484	08-Oct-2007	jeff	- Bail out of tdq_idled if !smp_started or idle stealing is disabled. This fixes a bug on UP machines with SMP kernels where the idle thread constantly switches after trying to steal work from the local cpu. - Make the idle stealing code more robust against self selection. - Prefer to steal from the cpu with the highest load that has at least one transferable thread. Before we selected the cpu with the highest transferable count which excludes bound threads. Collaborated with: csjp Approved by: re
# 172483	08-Oct-2007	jeff	- Restore historical sched_yield() behavior by changing sched_relinquish() to simply switch rather than lowering priority and switching. This allows threads of equal priority to run but not lesser priority. Discussed with: davidxu Reported by: NIIMI Satoshi <sa2c@sa2c.net> Approved by: re
# 172411	01-Oct-2007	jeff	- Reassign the thread queue lock to newtd prior to switching. Assigning after the switch leads to a race where the outgoing thread still owns the local queue lock while another cpu may switch it in. This race is only possible on machines where cpu_switch can take significantly longer on different cpus which in practice means HTT machines with unfair thread scheduling algorithms. Found by: kris (of course) Approved by: re
# 172409	01-Oct-2007	jeff	- Move the rebalancer back into hardclock to prevent potential softclock starvation caused by unbalanced interrupt loads. - Change the rebalancer to work on stathz ticks but retain randomization. - Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than complex sequences of locks to avoid deadlock. Reported by: kris Approved by: re
# 172345	27-Sep-2007	jeff	- Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default value for kern.sched.preempt_thresh appropriately. It can still by adjusted at runtime. ULE will still use IPI_PREEMPT in certain migration situations. - Assert that we're not trying to compile ULE on an unsupported architecture. To date, I believe only i386 and amd64 have implemented the third cpu switch argument required. Approved by: re
# 172308	23-Sep-2007	jeff	- Bound the interactivity score so that it cannot become negative. Approved by: re
# 172293	22-Sep-2007	jeff	- Improve grammar. s/it's/its/. - Improve load long-term load balancer by always IPIing exactly once. Previously the delay after rebalancing could cause problems with uneven workloads. - Allow nice to have a linear effect on the interactivity score. This allows negatively niced programs to stay interactive longer. It may be useful with very expensive Xorg servers under high loads. In general it should not be necessary to alter the nice level to improve interactive response. We may also want to consider never allowing positively niced processes to become interactive at all. - Initialize ccpu to 0 rather than 0.0. The decimal point was leftover from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE only exports weighted cpu values. Reported by: Steve Kargl (Load balancing problem) Approved by: re
# 172264	21-Sep-2007	jeff	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
# 172207	17-Sep-2007	jeff	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 171899	20-Aug-2007	jeff	- Set steal_thresh to log2(ncpus). This improves idle-time load balancing on 2cpu machines by reducing it to 1 by default. This improves loaded operation on 8cpu machines by increasing it to 3 where the extra idle time is not as critical. Approved by: re
# 171715	03-Aug-2007	jeff	- Fix one line that erroneously crept in my last commit. Approved by: re
# 171713	03-Aug-2007	jeff	- Share scheduler locks between hyper-threaded cores to protect the tdq_group structure. Hyper-threaded cores won't really benefit from seperate locks anyway. - Seperate out the migration case from sched_switch to simplify the main switch code. We only migrate here if called via sched_bind(). - When preempted place the preempted thread back in the same queue at the head. - Improve the cpu group and topology infrastructure. Tested by: many on current@ Approved by: re
# 171506	19-Jul-2007	jeff	- Refine the load balancer to improve buildkernel times on dual core machines. - Leave the long-term load balancer running by default once per second. - Enable stealing load from the idle thread only when the remote processor has more than two transferable tasks. Setting this to one further improves buildworld. Setting it higher improves mysql. - Remove the bogus pick_zero option. I had not intended to commit this. - Entirely disallow migration for threads with SRQ_YIELDING set. This balances out the extra migration allowed for with the load balancers. It also makes pick_pri perform better as I had anticipated. Tested by: Dmitry Morozovsky <marck@rinet.ru> Approved by: re
# 171505	19-Jul-2007	jeff	- When newtd is specified to sched_switch() it was not being initialized properly. We have to temporarily unlock the TDQ lock so we can lock the thread and add it to the run queue. This is used only for KSE. - When we add a thread from the tdq_move() via sched_balance() we need to ipi the target if it's sitting in the idle thread or it'll never run. Reported by: Rene Landan Approved by: re
# 171482	17-Jul-2007	jeff	ULE 3.0: Fine grain scheduler locking and affinity improvements. This has been in development for over 6 months as SCHED_SMP. - Implement one spin lock per thread-queue. Threads assigned to a run-queue point to this lock via td_lock. - Improve the facility for assigning threads to CPUs now that sched_lock contention no longer dominates scheduling decisions on larger SMP machines. - Re-write idle time stealing in an attempt to make it less damaging to general performance. This is still disabled by default. See kern.sched.steal_idle. - Call the long-term load balancer from a callout rather than sched_clock() so there are no locks held. This is disabled by default. See kern.sched.balance. - Parameterize many scheduling decisions via sysctls. Try to document these via sysctl descriptions. - General structural and naming cleanups. - Document each function with comments. Tested by: current@ amd64, x86, UP, SMP. Approved by: re
# 170787	15-Jun-2007	jeff	- Fix an off by one error in sched_pri_range. - In tdq_choose() only assert that a thread does not have too high a priority (low value) for the queue we removed it from. This will catch bugs in priority elevation. It's not a serious error for the thread to have too low a priority as we don't change queues in this case as an optimization. Reported by: kris
# 170600	12-Jun-2007	jeff	- Move some common code out of sched_fork_exit() and back into fork_exit().
# 170358	06-Jun-2007	jeff	- Placing the 'volatile' on the right side of the * in the td_lock declaration removes the need for __DEVOLATILE(). Pointed out by: tegge
# 170317	05-Jun-2007	jeff	- Better fix for previous error; use DEVOLATILE on the td_lock pointer it can actually sometimes be something other than sched_lock even on schedulers which rely on a global scheduler lock. Tested by: kan
# 170316	05-Jun-2007	jeff	- Pass &sched_lock as the third argument to cpu_switch() as this will always be the correct lock and we don't get volatile warnings this way. Pointed out by: kan
# 170315	05-Jun-2007	jeff	- Define TDQ_ID() for the !SMP case. - Default pick_pri to off. It is not faster in most cases.
# 170293	04-Jun-2007	jeff	Commit 1/14 of sched_lock decomposition. - Move all scheduler locking into the schedulers utilizing a technique similar to solaris's container locking. - A per-process spinlock is now used to protect the queue of threads, thread count, suspension count, p_sflags, and other process related scheduling fields. - The new thread lock is actually a pointer to a spinlock for the container that the thread is currently owned by. The container may be a turnstile, sleepqueue, or run queue. - thread_lock() is now used to protect access to thread related scheduling fields. thread_unlock() unlocks the lock and thread_set_lock() implements the transition from one lock to another. - A new "blocked_lock" is used in cases where it is not safe to hold the actual thread's lock yet we must prevent access to the thread. - sched_throw() and sched_fork_exit() are introduced to allow the schedulers to fix-up locking at these points. - Add some minor infrastructure for optionally exporting scheduler statistics that were invaluable in solving performance problems with this patch. Generally these statistics allow you to differentiate between different causes of context switches. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 168891	20-Apr-2007	kmacy	Schedule the ithread on the same cpu as the interrupt Tested by: kmacy Submitted by: jeffr
# 167669	17-Mar-2007	jeff	- Handle the case where slptime == runtime. Submitted by: Atoine Brodin
# 167664	17-Mar-2007	jeff	- Cast the intermediate value in priority computtion back down to unsigned char. Weirdly, casting the 1 constant to u_char still produces a signed integer result that is then used in the % computation. This avoids that mess all together and causes a 0 pri to turn into 255 % 64 as we expect. Reported by: kkenn (about 4 times, thanks)
# 167327	08-Mar-2007	julian	Instead of doing comparisons using the pcpu area to see if a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
# 167012	26-Feb-2007	kmacy	general LOCK_PROFILING cleanup - only collect timestamps when a lock is contested - this reduces the overhead of collecting profiles from 20x to 5x - remove unused function from subr_lock.c - generalize cnt_hold and cnt_lock statistics to be kept for all locks - NOTE: rwlock profiling generates invalid statistics (and most likely always has) someone familiar with that should review
# 166557	07-Feb-2007	jeff	- Change types for necent runq additions to u_char rather than int. - Fix these types in ULE as well. This fixes bugs in priority index calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64. Reported by: kkenn using pho's stress2.
# 166247	25-Jan-2007	jeff	- Implement much more intelligent ipi sending. This algorithm tries to minimize IPIs and rescheduling when scheduling like tasks while keeping latency low for important threads. 1) An idle thread is running. 2) The current thread is worse than realtime and the new thread is better than realtime. Realtime to realtime doesn't preempt. 3) The new thread's priority is less than the threshold.
# 166229	25-Jan-2007	jeff	- Get rid of the unused DIDRUN flag. This was really only present to support sched_4bsd. - Rename the KTR level for non schedgraph parsed events. They take event space from things we'd like to graph. - Reset our slice value after we sleep. The slice is simply there to prevent starvation among equal priorities. A thread which had almost exhausted it's slice and then slept doesn't need to be rescheduled a tick after it wakes up. - Set the maximum slice value to a more conservative 100ms now that it is more accurately enforced.
# 166208	24-Jan-2007	jeff	- With a sleep time over 2097 seconds hzticks and slptime could end up negative. Use unsigned integers for sleep and run time so this doesn't disturb sched_interact_score(). This should fix the invalid interactive priority panics reported by several users.
# 166190	23-Jan-2007	jeff	- Catch up to setrunqueue/choosethread/etc. api changes. - Define our own maybe_preempt() as sched_preempt(). We want to be able to preempt idlethread in all cases. - Define our idlethread to require preemption to exit. - Get the cpu estimation tick from sched_tick() so we don't have to worry about errors from a sampling interval that differs from the time domain. This was the source of sched_priority prints/panics and inaccurate pctcpu display in top.
# 166156	20-Jan-2007	jeff	- Disable the long-term load balancer. I believe that steal_busy works better and gives more predictable results.
# 166152	20-Jan-2007	jeff	- We do need to IPI the idlethread on some systems. It may be stuck in a power saving mode otherwise. - If the thread is already bound in sched_bind() unbind it before re-binding it to a new cpu. I don't like these semantics but they are expected by some code in the tree. Patch by jkoshy.
# 166137	20-Jan-2007	jeff	- In tdq_transfer() always set NEEDRESCHED when necessary regardless of the ipi settings. If NEEDRESCHED is set and an ipi is later delivered it will clear it rather than cause extra context switches. However, if we miss setting it we can have terrible latency. - In sched_bind() correctly implement bind. Also be slightly more tolerant of code which calls bind multiple times. However, we don't change binding if another call is made with a different cpu. This does not presently work with hwpmc which I believe should be changed.
# 166108	19-Jan-2007	jeff	Major revamp of ULE's cpu load balancing: - Switch back to direct modification of remote CPU run queues. This added a lot of complexity with questionable gain. It's easy enough to reimplement if it's shown to help on huge machines. - Re-implement the old tdq_transfer() call as tdq_pickidle(). Change sched_add() so we have selectable cpu choosers and simplify the logic a bit here. - Implement tdq_pickpri() as the new default cpu chooser. This algorithm is similar to Solaris in that it tries to always run the threads with the best priorities. It is actually slightly more complex than solaris's algorithm because we also tend to favor the local cpu over other cpus which has a boost in latency but also potentially enables cache sharing between the waking thread and the woken thread. - Add a bunch of tunables that can be used to measure effects of different load balancing strategies. Most of these will go away once the algorithm is more definite. - Add a new mechanism to steal threads from busy cpus when we idle. This is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The threshold is the required length of a tdq's run queue before another cpu will be able to steal runnable threads. This prevents most queue imbalances that contribute the long latencies.
# 165830	06-Jan-2007	jeff	- Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer divide faults in roundup() later if it is able to return 0. For some reason this bug only shows up on my laptop and not my testboxes.
# 165827	06-Jan-2007	jeff	- Fix the sched_priority() invalid priority bugs. Use roundup() instead of max() when computing the divisor in SCHED_TICK_PRI(). This prevents cases where rounding down would allow the quotient to exceed SCHED_PRI_RANGE. - Garbage collect some unused flags and fields. - Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply duplicated this functionality. - Re-enable the rebalancer by default and fix the sysctl so it can be modified.
# 165821	06-Jan-2007	jeff	- Don't IPI unless we're going to interrupt something exiting in the kernel. otherwise we can afford the latency. This makes a significant performance improvement.
# 165819	05-Jan-2007	jeff	- Fix a comparison in sched_choose() that caused cpus to be constantly marked idle, thus breaking cpu load balancing. - Change sched_interact_update() to fix cases where the stored history has expanded significantly rather than handling them in the callers. This fixes a case where sched_priority() could compute a bad value. - Add a sysctl to disable the global load balancer for experimentation.
# 165796	05-Jan-2007	jeff	- ftick was initialized to -1 for init and any of it's children. Fix this by setting ftick = ltick = ticks in schedinit(). - Update the priority when we are pulled off of the run queue and when we are inserted onto the run queue so that it more accurately reflects our present status. This is important for efficient priority propagation functioning. - Move the frequency test into sched_pctcpu_update() so we don't repeat it each time we'd like to call it. - Put some temporary work-around code in sched_priority() in case the tick mechanism produces a bad priority. Eventually this should revert to an assert again.
# 165766	04-Jan-2007	jeff	- Only allow the tdq_idx to increase by one each tick rather than up to the most recently chosen index. This significantly improves nice behavior. This allows a lower priority thread to run some multiple of times before the higher priority thread makes it to the front of the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice difference of 1 makes a 4% difference in cpu usage between two hogs. - Track a seperate insert and removal index. When the removal index is empty it is updated to point at the current insert index. - Don't remove and re-add a thread to the runq when it is being adjusted down in priority. - Pull some conditional code out of sched_tick(). It's looking a bit large now.
# 165762	04-Jan-2007	jeff	ULE 2.0: - Remove the double queue mechanism for timeshare threads. It was slow due to excess cache lines in play, caused suboptimal scheduling behavior with niced and other non-interactive processes, complicated priority lending, etc. - Use a circular queue with a floating starting index for timeshare threads. Enforces fairness by moving the insertion point closer to threads with worse priorities over time. - Give interactive timeshare threads real-time user-space priorities and place them on the realtime/ithd queue. - Select non-interactive timeshare thread priorities based on their cpu utilization over the last 10 seconds combined with the nice value. This gives us more sane priorities and behavior in a loaded system as compared to the old method of using the interactivity score. The interactive score quickly hit a ceiling if threads were non-interactive and penalized new hog threads. - Use one slice size for all threads. The slice is not currently dynamically set to adjust scheduling behavior of different threads. - Add some new sysctls for scheduling parameters. Bug fixes/Clean up: - Fix zeroing of td_sched after initialization in sched_fork_thread() caused by recent ksegrp removal. - Fix KSE interactivity issues related to frequent forking and exiting of kse threads. We simply disable the penalty for thread creation and exit for kse threads. - Cleanup the cpu estimator by using tickincr here as well. Keep ticks and ltick/ftick in the same frequency. Previously ticks were stathz and others were hz. - Lots of new and updated comments. - Many many others. Tested on: up x86/amd64, 8way amd64.
# 165627	29-Dec-2006	jeff	- More search and replace prettying.
# 165620	29-Dec-2006	jeff	- Clean up a bit after the most recent KSE restructuring.
# 164939	06-Dec-2006	julian	Changes to try fix sched_ule.c courtesy of David Xu.
# 164936	06-Dec-2006	julian	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
# 164091	08-Nov-2006	maxim	o Fix a couple of obvious typos.
# 163709	26-Oct-2006	jb	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# 161599	25-Aug-2006	davidxu	Add user priority loaning code to support priority propagation for 1:1 threading's POSIX priority mutexes, the code is no-op unless priority-aware umtx code is committed.
# 159630	15-Jun-2006	davidxu	Add scheduler API sched_relinquish(), the API is used to implement yield() and sched_yield() syscalls. Every scheduler has its own way to relinquish cpu, the ULE and CORE schedulers have two internal run- queues, a timesharing thread which calls yield() syscall should be moved to inactive queue.
# 159570	13-Jun-2006	davidxu	Add scheduler CORE, the work I have done half a year ago, recent, I picked it up again. The scheduler is forked from ULE, but the algorithm to detect an interactive process is almost completely different with ULE, it comes from Linux paper "Understanding the Linux 2.6.8.1 CPU Scheduler", although I still use same word "score" as a priority boost in ULE scheduler. Briefly, the scheduler has following characteristic: 1. Timesharing process's nice value is seriously respected, timeslice and interaction detecting algorithm are based on nice value. 2. per-cpu scheduling queue and load balancing. 3. O(1) scheduling. 4. Some cpu affinity code in wakeup path. 5. Support POSIX SCHED_FIFO and SCHED_RR. Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler uses 256 priority queues. Unlike ULE which using pull and push, the scheduelr uses pull method, the main reason is to let relative idle cpu do the work, but current the whole scheduler is protected by the big sched_lock, so the benefit is not visible, it really can be worse than nothing because all other cpu are locked out when we are doing balancing work, which the 4BSD scheduelr does not have this problem. The scheduler does not support hyperthreading very well, in fact, the scheduler does not make the difference between physical CPU and logical CPU, this should be improved in feature. The scheduler has priority inversion problem on MP machine, it is not good for realtime scheduling, it can cause realtime process starving. As a result, it seems the MySQL super-smack runs better on my Pentium-D machine when using libthr, despite on UP or SMP kernel.
# 159337	06-Jun-2006	davidxu	Make ke_rqindex unsigned.
# 153749	27-Dec-2005	davidxu	Use variable i instead of variable cpus as an index to get correct kseq.
# 153533	19-Dec-2005	davidxu	Fix a bug in slice calculation code, current code uses hz but sched_clock() is called by state clock. Submitted by: taku at tackymt dot homeip dot net
# 150443	21-Sep-2005	davidxu	This is a forced commit. The previous change, revision 1.158 was intend to fix the PR kern/86067.
# 150442	21-Sep-2005	davidxu	Temporarily disable nice threshold detection code, as it can starve a thread holding critical resource, e.g mutex or other implicit synchronous flags. Give thread which exceeds nice threshold a minimum time slice. PR: kern/86087
# 149278	19-Aug-2005	davidxu	Move up code for testing KEF_HOLD to avoid ke_cpu being changed unexpectly for PRI_ITHD and PRI_REALTIME threads.
# 148856	08-Aug-2005	davidxu	Try best to keep a preempted thread at front of run queue, this seems improved performance a bit for some workloads, but still seeing interactive lagging unless cpu idling race is fixed.
# 148603	31-Jul-2005	davidxu	If a thread was removed from system run queue, kse_assign shouldn't add it again.
# 148383	25-Jul-2005	delphij	Cast to uintptr_t when the compiler complains. This unbreaks ULE scheduler breakage accompanied by the recent atomic_ptr() change.
# 147565	23-Jun-2005	peter	Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit being in opt_global.h and forcing a global recompile when only a few files reference it. Approved by: re
# 147068	07-Jun-2005	jeff	- Fix the case where we're not preempting but there is already a newtd as this happens via thread_switchout(). I don't particularly like the structure of the code here. We twice call out to thread code when a thread is voluntarily switching. Once to thread_switchout() and once to slot_fill(), while sched_4BSD does even more work which is redundant to select another thread to use our remaining slice. This should be simplified in the future, but for now I'm only going to fix the bug not the bad design.
# 146955	04-Jun-2005	jeff	- It's 2005 already, I've been working on this for three years.
# 146954	04-Jun-2005	jeff	- Don't SLOT_USE() in the preempt case, sched_add() has already taken the slot for us. Previously, we would take two slots on every preempt, and setrunqueue() would fix it up for us in the non threaded case. The threaded case was simply broken. - Clean up flags, prototypes, comments.
# 145256	19-Apr-2005	jkoshy	Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities and documentation into -CURRENT. Bump FreeBSD_version. Reviewed by: alc, jhb (kernel changes)
# 144777	08-Apr-2005	ups	Sprinkle some volatile magic and rearrange things a bit to avoid race conditions in critical_exit now that it no longer blocks interrupts. Reviewed by: jhb
# 142273	22-Feb-2005	jeff	- A test in sched_switch() is no longer necessary and it is incorrect when td0 is preempted before it voluntarily switches. Discovered by: Arjan Van Leeuwen <avleeuwen@gmail.com>
# 141292	04-Feb-2005	jeff	- Add ke_runq == NULL to the conditions which will cause us to abort adjusting timeshare loads in sched_class(). This is only important if the thread has never run, otherwise the state checks should work as expected.
# 139455	30-Dec-2004	jhb	Fix a typo and two whitespace nits.
# 139453	30-Dec-2004	jhb	Rework the interface between priority propagation (lending) and the schedulers a bit to ensure more correct handling of priorities and fewer priority inversions: - Add two functions to the sched(9) API to handle priority lending: sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these functions to ask the scheduler to lend a thread a set priority and to tell the scheduler when it thinks it is ok for a thread to stop borrowing priority. The unlend case is slightly complex in that the turnstile code tells the scheduler what the minimum priority of the thread needs to be to satisfy the requirements of any other threads blocked on locks owned by the thread in question. The scheduler then decides where the thread can go back to normal mode (if it's normal priority is high enough to satisfy the pending lock requests) or it it should continue to use the priority specified to the sched_unlend_prio() call. This involves adding a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag for priority elevation. - Schedulers now refuse to lower the priority of a thread that is currently borrowing another therad's priority. - If a scheduler changes the priority of a thread that is currently sitting on a turnstile, it will call a new function turnstile_adjust() to inform the turnstile code of the change. This function resorts the thread on the priority list of the turnstile if needed, and if the thread ends up at the head of the list (due to having the highest priority) and its priority was raised, then it will propagate that new priority to the owner of the lock it is blocked on. Some additional fixes specific to the 4BSD scheduler include: - Common code for updating the priority of a thread when the user priority of its associated kse group has been consolidated in a new static function resetpriority_thread(). One change to this function is that it will now only adjust the priority of a thread if it already has a time sharing priority, thus preserving any boosts from a tsleep() until the thread returns to userland. Also, resetpriority() no longer calls maybe_resched() on each thread in the group. Instead, the code calling resetpriority() is responsible for calling resetpriority_thread() on any threads that need to be updated. - schedcpu() now uses resetpriority_thread() instead of just calling sched_prio() directly after it updates a kse group's user priority. - sched_clock() now uses resetpriority_thread() rather than writing directly to td_priority. - sched_nice() now updates all the priorities of the threads after the group priority has been adjusted. Discussed with: bde Reviewed by: ups, jeffr Tested on: 4bsd, ule Tested on: i386, alpha, sparc64
# 139336	26-Dec-2004	jeff	- Unintentionally checked in a debugging panic. Remove that.
# 139334	26-Dec-2004	jeff	- Fix a long standing problem where an ithread would not honor sched_pin(). - Remove the sched_add wrapper that used sched_add_internal() as a backend. Its only purpose was to interpret one flag and turn it into an int. Do the right thing and interpret the flag in sched_add() instead. - Pass the flag argument to sched_add() to kseq_runq_add() so that we can get the SRQ_PREEMPT optimization too. - Add a KEF_INTERNAL flag. If KEF_INTERNAL is set we don't adjust the SLOT counts, otherwise the slot counts are adjusted as soon as we enter sched_add() or sched_rem() rather than when the thread is actually placed on the run queue. This greatly simplifies the handling of slots. - Remove the explicit prevention of migration for ithreads on non-x86 platforms. This was never shown to have any real benefit. - Remove the unused class argument to KSE_CAN_MIGRATE(). - Add ktr points for thread migration events. - Fix a long standing bug on platforms which don't initialize the cpu topology. The ksg_maxid variable was never correctly set on these platforms which caused the long term load balancer to never inspect more than the first group or processor. - Fix another bug which prevented the long term load balancer from working properly. If stathz != hz we can't expect sched_clock() to be called on the exact tick count that we're anticipating. - Rearrange sched_switch() a bit to reduce indentation levels.
# 139316	25-Dec-2004	jeff	- Remove earlier KTR_ULE tracepoints. - Define new KTR_SCHED points so that we can graph the operation of the scheduler.
# 138843	14-Dec-2004	jeff	- Garbage collect several unused members of struct kse and struce ksegrp. As best as I can tell, some of these were never used.
# 138842	14-Dec-2004	jeff	- In kseq_choose(), don't recalculate slice values for processes with a nice of 0. Doing so can cause an infinite loop because they should be running, but a nice -20 process could prevent them from doing so. - Add a new flag KEF_PRIOELEV to flag a thread that has had its priority elevated due to priority propagation. If a thread has had its priority elevated, we assume that it must go on the current queue and it must get a slice. - In sched_userret() if our priority was elevated and we shouldn't have a timeslice, yield here until we should. Found/Tested by: glebius
# 138802	13-Dec-2004	jeff	- Take up a 'slot' while we're on the assigned queue, waiting to be posted to another processor. Otherwise, kern_switch() gets confused and tries to sched_add(NULL).
# 137588	11-Nov-2004	jeff	- Temporarily disable the nice -20 throttling code. It has some interaction with APM that I do not understand yet. Reported & Tested by: glebius
# 137067	30-Oct-2004	jeff	- When choosing a thread on the run queue, check to see if its nice is outside of the nice threshold due to a recently awoken thread with a lower nice value. This further reduces the amount of time a positively niced thread gets while running in conjunction with a workload that has many short sleeps (ie buildworld).
# 137061	30-Oct-2004	jeff	- In sched_prio() check to see if the kse is assigned to a runq as the check for TD_ON_RUNQ() no longer means the thread is really on a run- queue. I suspect this state should be re-evaluated as it must mean something else now. This fixes ULE+KSE+PREEMPTION on UP x86.
# 136173	05-Oct-2004	julian	Fix whitespace botch that only showed up in the commit message diff :-/ MFC after: 4 days
# 136170	05-Oct-2004	julian	When preempting a thread, put it back on the HEAD of its run queue. (Only really implemented in 4bsd) MFC after: 4 days
# 136169	05-Oct-2004	julian	Oops. left out part of the diff. MFC after: 4 days
# 136167	05-Oct-2004	julian	Use some macros to trach available scheduler slots to allow easier debugging. MFC after: 4 days
# 135295	16-Sep-2004	julian	clean up thread runq accounting a bit. MFC after: 3 days
# 135076	11-Sep-2004	scottl	Revert the previous round of changes to td_pinned. The scheduler isn't fully initialed when the pmap layer tries to call sched_pini() early in the boot and results in an quick panic. Use ke_pinned instead as was originally done with Tor's patch. Approved by: julian
# 135059	10-Sep-2004	julian	Try committing from the right tree this time MFC after: 2 days
# 135056	10-Sep-2004	julian	Make up my mind if cpu pinning is stored in the thread structure or the scheduler specific extension to it. Put it in the extension as the implimentation details of how the pinning is done needn't be visible outside the scheduler. Submitted by: tegge (of course!) (with changes) MFC after: 3 days
# 135051	10-Sep-2004	julian	Add some code to allow threads to nominat a sibling to run if theyu are going to sleep. MFC after: 1 week
# 134791	05-Sep-2004	julian	Refactor a bunch of scheduler code to give basically the same behaviour but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
# 134649	02-Sep-2004	scottl	Turn PREEMPTION into a kernel option. Make sure that it's defined if FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is enabled (code inspired by the PREEMPTION warning in kern_switch.c). This is a possible MT5 candidate.
# 134586	01-Sep-2004	julian	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days
# 134415	27-Aug-2004	peter	Commit Jeff's suggested changes for avoiding a bug that is exposed by preemption and/or the rev 1.79 kern_switch.c change that was backed out. The thread was being assigned to a runq without adding in the load, which would cause the counter to hit -1.
# 133555	12-Aug-2004	jeff	- Introduce a new flag KEF_HOLD that prevents sched_add() from doing a migration. Use this in sched_prio() and sched_switch() to stop us from migrating threads that are in short term sleeps or are runnable. These extra migrations were added in the patches to support KSE. - Only set NEEDRESCHED if the thread we're adding in sched_add() is a lower priority and is being placed on the current queue. - Fix some minor whitespace problems.
# 133427	10-Aug-2004	jeff	- Use a new flag, KEF_XFERABLE, to record with certainty that this kse had contributed to the transferable load count. This prevents any potential problems with sched_pin() being used around calls to setrunqueue(). - Change the sched_add() load balancing algorithm to try to migrate on wakeup. This attempts to place threads that communicate with each other on the same CPU. - Don't clear the idle counts in kseq_transfer(), let the cpus do that when they call sched_add() from kseq_assign(). - Correct a few out of date comments. - Make sure the ke_cpu field is correct when we preempt. - Call kseq_assign() from sched_clock() to catch any assignments that were done without IPI. Presently all assignments are done with an IPI, but I'm trying a patch that limits that. - Don't migrate a thread if it is still runnable in sched_add(). Previously, this could only happen for KSE threads, but due to changes to sched_switch() all threads went through this path. - Remove some code that was added with preemption but is not necessary.
# 132776	28-Jul-2004	kan	Avoid casts as lvalues.
# 132589	23-Jul-2004	scottl	Clean up whitespace, increase consistency and correctness. Submitted by: bde
# 132372	18-Jul-2004	julian	When calling scheduler entrypoints for creating new threads and processes, specify "us" as the thread not the process/ksegrp/kse. You can always find the others from the thread but the converse is not true. Theorotically this would lead to runtime being allocated to the wrong entity in some cases though it is not clear how often this actually happenned. (would only affect threaded processes and would probably be pretty benign, but it WAS a bug..) Reviewed by: peter
# 132266	16-Jul-2004	jhb	- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags since they are only accessed by curthread and thus do not need any locking. - Move pr_addr and pr_ticks out of struct uprof (which is per-process) and directly into struct thread as td_profil_addr and td_profil_ticks as these variables are really per-thread. (They are used to defer an addupc_intr() that was too "hard" until ast()).
# 131929	10-Jul-2004	marcel	Update for the KDB framework: o Call kdb_backtrace() instead of backtrace().
# 131839	08-Jul-2004	jhb	- Move contents of sched_add() into a sched_add_internal() function that takes an argument to specify if it should preempt or not. Don't preempt when sched_add_internal() is called from kseq_idled() or kseq_assign() as in those cases we are about to call mi_switch() anyways. Also, doing so during the first context switch on an AP leads to a NULL pointer deref because curthread is NULL. - Reenable preemption for ULE. Submitted by: Taku YAMAMOTO taku at tackymt.homeip.net
# 131678	06-Jul-2004	rwatson	Temporarily disable preemption in SCHED_ULE due to reported panics and hangs due to recent preemption changes. This change appears to remove the panic that I was running into, but at the cost of increasing ithread scheduling latency, and as such is a temporary band-aid until jhb has a chance to resolve the ule<->preemption interaction that is the source of the problem. If it doesn't fix the problem for others-- sorry!
# 131527	03-Jul-2004	phk	Add NULL arg to mi_switch() call to stop kernel compiles from breaking.
# 131510	02-Jul-2004	bmilekic	Fix SCHED_ULE build on SMP. The previous revision (1.110) introduced a KSE_CAN_MIGRATE() invocation with one argument missing (class). Either this is a genuine forget or it crept in from JHB's repo where he may have modified it. If it's the latter then it may require more attention. For now fix the make depend.
# 131481	02-Jul-2004	jhb	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
# 131473	02-Jul-2004	jhb	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
# 130881	21-Jun-2004	scottl	Add the sysctl node 'kern.sched.name' that has the name of the scheduler currently in use. Move the 4bsd kern.quantum node to kern.sched.quantum for consistency.
# 130551	15-Jun-2004	julian	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
# 129982	02-Jun-2004	jeff	- Run sched_balance() and sched_balance_groups() from hardclock via sched_clock() rather than using callouts. This means we no longer have to take the load of the callout thread into consideration while balancing and should make the balancing decisions simpler and more accurate. Tested on: x86/UP, amd64/SMP
# 128563	22-Apr-2004	obrien	There was a thread on "unusually high load averages" when running under sched_ule, in January 2004. Looking at this, "pagezero" is (one of) the culprit(s). We had no provision for processes with P_NOLOAD set. With pagezero not running at PRI_ITHD, kseq_load_{add,rem} count pagezero as another-normal-process, thus the "expected-plus-one" load reported in the above thread. Submitted by: Nikos Ntarmos <ntarmos@ceid.upatras.gr>
# 128055	09-Apr-2004	cognet	Spell "switches" a more conventional way.
# 127850	04-Apr-2004	jeff	- Use the proper constant in sched_interact_update(). Previously, SCHED_INTERACT_MAX was used where SCHED_SLP_RUN_MAX was needed. This was causing the interactivity scaler to lose history at a more dramatic rate than intended.
# 127498	27-Mar-2004	marcel	Change the type of the various CPU masks to cpumask_t. Note that as long as there are still explicit uses of int, whether in types or in function names (such as atomic_set_int() in sched_ule.c), we can not change cpumask_t to be anything other than u_int. See also the commit log for sys/sys/types.h, revision 1.84.
# 127278	21-Mar-2004	obrien	Give a more reasonable CPU time to the threads which are using scheduler activation (i.e., applications are using libpthread). This is because SCHED_ULE sometimes puts P_SA processes into ksq_next unnecessarily. Which doesn't give fair amount of CPU time to processes which are using scheduler-activation-based threads when other (semi-)CPU-intensive, non-P_SA processes are running. Further work will no doubt be done by jeffr at a later date. Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp> Reviewed by: rwatson, freebsd-current@
# 126326	27-Feb-2004	jhb	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# 125299	01-Feb-2004	jeff	- Allow interactive tasks to use the maximum time-slice. This is not as detrimental as I thought it would be in the case of massive process storms from a shell and it makes regular desktop usage noticeably better.
# 125289	01-Feb-2004	jeff	- Add a new member to struct kseq called ksq_sysload. This is intended to track the load for the sched_load() function. In the SMP case this member is not defined because it would be redundant with the ksg_load member which already tracks the non ithd load. - For sched_load() in the UP case simply return ksq_sysload. In the SMP case traverse the list of kseq groups and sum up their ksg_load fields.
# 124959	25-Jan-2004	jeff	- sched_strict has been dead for a long time now. Get rid of it.
# 124958	25-Jan-2004	jeff	- Clean up KASSERTS.
# 124944	25-Jan-2004	jeff	- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or SW_INVOL. Assert that one of these is set in mi_switch() and propery adjust the rusage statistics. This is to simplify the large number of users of this interface which were previously all required to adjust the proper counter prior to calling mi_switch(). This also facilitates more switch and locking optimizations. - Change all callers of mi_switch() to pass the appropriate paramter and remove direct references to the process statistics.
# 123694	20-Dec-2003	jeff	- Make our transfer decisions based on load and not transferable load. A cpu could have been bogged down with non-transferable load and still not migrated a new thread to an idle cpu. This required some benchmarking and tuning to get right as the comment above it suggests.
# 123693	20-Dec-2003	jeff	- Enable ithread migration on x86. This is done to work around a bug in the IO APIC on Xeons that prevents round-robin interrupt assignment from working.
# 123685	20-Dec-2003	jeff	- In kseq_transfer() return if smp has not been started. - In sched_add(), do the idle check prior to the transfer check so that we don't try to transfer load from an idle cpu. This fixes panics caused by IPIs on UP machines running SMP kernels. Reported/Debugged by: seanc
# 123684	20-Dec-2003	jeff	- Running interactive tasks with the minimum time-slice is fine for vi and sh, but not so great for mozilla, X, etc. Add a fixed define for the slice size granted to interactive KSEs.
# 123529	14-Dec-2003	jeff	- Assign the ke_cpu field in kseq_notify() so that all of our callers do not have to do it. - Set the ke_runq to NULL in sched_add() before calling kseq_notify(). Otherwise we may panic in sched_add() if INVARIANTS is on.
# 123487	12-Dec-2003	jeff	- Now that we have kseq groups, balance them seperately. - The new sched_balance_groups() function does intra-group balancing while sched_balance() balances the available groups. - Pick a random time between 0 ticks and hz * 2 ticks to restart each balancing process. Each balancer has its own timeout. - Pick a random place in the list of groups to start the search for lowest and highest group loads. This prevents us from prefering a group based on numeric position. - Use a nasty hack to stop us from preferring cpu 0. The problem is that softclock always runs on cpu 0, so it always has a little extra load. We ignore this load in the balancer for now. In the future softclock should run on a random cpu and these hacks can go away.
# 123435	11-Dec-2003	jeff	- Don't let the pctcpu rate limiter throttle us if we have recorded over SCHED_CPU_TICKS ticks. This was allowing processes to display (1/SCHED_CPU_TIME * 100) % more cpu than they had used.
# 123434	11-Dec-2003	jeff	- In sched_switch(), if a thread has been assigned, don't touch the runqueues or load. These things have already been taken care of in sched_bind() which should be the only place that we're switching in an assigned thread.
# 123433	11-Dec-2003	jeff	- Add support for CPU groups to ule. All SMT cores on the same physical cpu are added to a group. - Don't place a cpu into the kseq_idle bitmask until all cpus in that group have idled. - Prefer idle groups over idle group members in the new kseq_transfer() function. In this way we will prefer to balance load across full cores rather than add further load a partial core. - Before a cpu goes idle, check the other group members for threads. Since SMT cpus may freely share threads, this is cheap. - SMT cores may be individually pinned and bound to now. This contrasts the old mechanism where binding or pinning would have allowed a thread to run on any available cpu. - Remove some unnecessary logic from sched_switch(). Priority propagation should be properly taken care of in sched_prio() now.
# 123231	07-Dec-2003	peter	rqb_bits[] may be an int64_t (eg: on alpha, and recently on amd64). Be sure to shift (long)1 << 33 and higher, not (int)1. Otherwise bad things happen(TM). This is why beast.freebsd.org paniced with ULE. Reviewed by: jeff
# 123126	03-Dec-2003	jhb	Fix all users of mp_maxid to use the same semantics, namely: 1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1. 2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid. Approved by: re (scottl) Tested on: i386, amd64, alpha
# 122848	17-Nov-2003	jeff	- Mark ksq_assigned as volatile so that when this code is used without sched_lock we can be sure that we'll pick up the new value.
# 122847	17-Nov-2003	jeff	- Remove long dead code. rslices hasn't been used in some time and neither has sched_pickcpu().
# 122744	15-Nov-2003	jeff	- Introduce kseq_runq_{add,rem}() which are used to insert and remove kses from the run queues. Also, on SMP, we track the transferable count here. Threads are transferable only as long as they are on the run queue. - Previously, we adjusted our load balancing based on the transferable count minus the number of actual cpus. This was done to account for the threads which were likely to be running. All of this logic is simpler now that transferable accounts for only those threads which can actually be taken. Updated various places in sched_add() and kseq_balance() to account for this. - Rename kseq_{add,rem} to kseq_load_{add,rem} to reflect what they're really doing. The load is accounted for seperately from the runq because the load is accounted for even as the thread is running. - Fix a bug in sched_class() where we weren't properly using the PRI_BASE() version of the kg_pri_class. - Add a large comment that describes the impact of a seemingly simple conditional in sched_add(). - Also in sched_add() check the transferable count and KSE_CAN_MIGRATE() prior to checking kseq_idle. This reduces the frequency of access for kseq_idle which is a shared resource.
# 122165	06-Nov-2003	jeff	- Somehow I botched my last commit. Add an extra ( to fix things up. I'm still not sure how this happened. Reported by: ps
# 122158	06-Nov-2003	jeff	- Remove the local definition of sched_pin and unpin. They are provided in sched.h now. - Respect the td pin count.
# 122094	05-Nov-2003	jeff	- It's ok if sched_runnable() has races in it, we don't need the sched_lock here unless we have something on the assigned queue.
# 122038	04-Nov-2003	jeff	- Add initial support for pinning and binding.
# 121923	03-Nov-2003	jeff	- Remove kseq_find(), we no longer scan other cpu's run queues when we go idle. They figure out that we're idle fast enough that the cache pollution introduces by scanning their run queue is more expensive than waiting a little longer. - Add kseq_setidle() to mark us as being idle. Use this in place of kseq_find(). - Remove kseq_load_highest(), kseq_find() was the only consumer of this interface. kseq_balance() has it's own customized version that finds the lowest and highest loads simultaneously. Continuously told that this would be faster by: terry
# 121896	02-Nov-2003	jeff	- Remove the ksq_loads[] array. We are only interested in three counts, the total load, the timeshare load, and the number of threads that can be migrated to another cpu. Account for these seperately. - Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE can be migrated to another CPU. Currently, this only checks to see if we're an interrupt handler. Eventually this will also be used to support CPU binding.
# 121872	02-Nov-2003	jeff	- In sched_prio() only force us onto the current queue if our priority is being elevated (numerically smaller).
# 121871	02-Nov-2003	jeff	- Rename SCHED_PRI_NTHRESH to SCHED_SLICE_NTHRESH since it is only used in slice assignment. Add a comment describing what it does. - Remove a stale XXX comment, the nice should not impact the interactivity, nice adjustments only effect non-interactive tasks in ULE. - Don't allow nice -20 tasks to totally starve nice 0 tasks. Give them at least SCHED_SLICE_MIN ticks. We still allow nice 0 tasks to starve nice +20 tasks as intended.
# 121869	02-Nov-2003	jeff	- Remove uses of PRIO_TOTAL and replace them with SCHED_PRI_NRESV - SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we do not have to account for it in the few places that we use it. Requested by: bde
# 121868	02-Nov-2003	jeff	- Change sched_interact_update() to only accept slp+runtime values between 0 and SCHED_SLP_RUN_MAX * 2. This allows us to simplify the algorithm quite a bit. Before, it dealt with arbitrary values which required us to do nasty integer division tricks that didn't quite work out correctly. - Chnage sched_wakeup() to detect conditions where the slp+runtime could exceed SCHED_SLP_RUN_MAX * 2. This can happen if we go to sleep for longer than 6 seconds. In this case, we'll just clear the runtime and set the sleep time to the max. - Define a new function, sched_interact_fork() which updates the slp+runtime of a newly forked thread. We want to limit the amount of history retained from the parent so that we learn the child's behavior quickly. We don't, however want to decay it to nothing. Previously, we would simply divide each parameter by 100 whenever we forked. After a few forks the values would reach 0 and tasks would not be considered interactive. - Add another KTR entry, cleanup some existing entries. - Remove a useless sched_interact_update() from sched_priority(). This is already done by the callers that require it.
# 121790	31-Oct-2003	jeff	- Add static to local functions and data where it was missing. - Add an IPI based mechanism for migrating kses. This mechanism is broken down into several components. This is intended to reduce cache thrashing by eliminating most cases where one cpu touches another's run queues. - kseq_notify() appends a kse to a lockless singly linked list and conditionally sends an IPI to the target processor. Right now this is protected by sched_lock but at some point I'd like to get rid of the global lock. This is why I used something more complicated than a standard queue. - kseq_assign() processes our list of kses that have been assigned to us by other processors. This simply calls sched_add() for each item on the list after clearing the new KEF_ASSIGNED flag. This flag is used to indicate that we have been appeneded to the assigned queue but not added to the run queue yet. - In sched_add(), instead of adding a KSE to another processor's queue we use kse_notify() so that we don't touch their queue. Also in sched_add(), if KEF_ASSIGNED is already set return immediately. This can happen if a thread is removed and readded so that the priority is recorded properly. - In sched_rem() return immediately if KEF_ASSIGNED is set. All callers immediately readd simply to adjust priorites etc. - In sched_choose(), if we're running an IDLE task or the per cpu idle thread set our cpumask bit in 'kseq_idle' so that other processors may know that we are idle. Before this, make a single pass through the run queues of other processors so that we may find work more immediately if it is available. - In sched_runnable(), don't scan each processor's run queue, they will IPI us if they have work for us to do. - In sched_add(), if we're adding a thread that can be migrated and we have plenty of work to do, try to migrate the thread to an idle kseq. - Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into consideration. - No longer use kseq_choose() to steal threads, it can lose it's last argument. - Create a new function runq_steal() which operates like runq_choose() but skips threads based on some criteria. Currently it will not steal PRI_ITHD threads. In the future this will be used for CPU binding. - Create a kseq_steal() that checks each run queue with runq_steal(), use kseq_steal() in the places where we used kseq_choose() to steal with before.
# 121682	29-Oct-2003	bde	Removed sched_nest variable in sched_switch(). Context switches always begin with sched_lock held but not recursed, so this variable was always 0. Removed fixup of sched_lock.mtx_recurse after context switches in sched_switch(). Context switches always end with this variable in the same state that it began in, so there is no need to fix it up. Only sched_lock.mtx_lock really needs a fixup. Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion that sched_lock is owned and not recursed after it is fixed up. This assertion much match the one in mi_switch(), and if sched_lock were recursed then a non-null fixup of sched_lock.mtx_recurse would probably be needed again, unlike in sched_switch(), since fork_exit() doesn't return to its caller in the normal way.
# 121625	28-Oct-2003	jeff	- Only change the run queue in sched_prio() if the kse is non null. threads can be in the TD_ON_RUNQ state and not have an associated kse. - Remove the PRI_IDLE special case from sched_clock(), it was not actually necessary.
# 121605	27-Oct-2003	jeff	- Use a better algorithm in sched_pctcpu_update() Contributed by: Thomaswuerfl@gmx.de - In sched_prio(), adjust the run queue for threads which may need to move to the current queue due to priority propagation . - In sched_switch(), fix style bug introduced when the KSE support went in. Columns are 80 chars wide, not 90. - In sched_switch(), Fix the comparison in the idle case and explicitly re-initialize the runq in the not propagated case. - Remove dead code in sched_clock(). - In sched_clock(), If we're an IDLE class td set NEEDRESCHED so that threads that have become runnable will get a chance to. - In sched_runnable(), if we're not the IDLETD, we should not consider curthread when examining the load. This mimics the 4BSD behavior of returning 0 when the only runnable thread is running. - In sched_userret(), remove the code for setting NEEDRESCHED entirely. This is not necessary and is not implemented in 4BSD. - Use the correct comparison in sched_add() when checking to see if an idle prio task has had it's priority temporarily elevated.
# 121290	20-Oct-2003	jeff	- If a thread is not bound to a kse return 0 from sched_pctcpu(). Reported by: pawel.worach@nordea.com
# 121146	16-Oct-2003	jeff	- Only kse_reassign() in the !running case. Reported by: kris
# 121145	16-Oct-2003	jeff	- Call sched_add() with the correct argument on SMP. Reported by: Valentin Chopov <valentin@valcho.net>
# 121132	16-Oct-2003	jeff	- Fix a minor problem with my last commit, we don't want to return from sched_switch if the thread is running, we want to fall through and pick a new thread because we have been preempted.
# 121128	16-Oct-2003	jeff	- Collapse sched_switchin() and sched_switchout() into sched_switch(). Now mi_switch() calls sched_switch() which calls cpu_switch(). This is actually one less function call than it had been.
# 121127	16-Oct-2003	jeff	- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td argument rather than a kse.
# 121126	16-Oct-2003	jeff	- The non iterative algorithm for interact_update was broken due to rounding errors. This was the source of the majority of the interactivity problems. Reintroduce the old algorithm and its XXX. - Up the interactivity threshold to 30. It really could stand to be even a tiny bit higher. - Let the sleep and run time accumulate up to 5 seconds of history rather than two. This helps stop XFree86 from becoming non-interactive during bursts of activity.
# 121107	15-Oct-2003	jeff	- If our user_pri doesn't match our actual priority our priority has been elevated either due to priority propagation or because we're in the kernel in either case, put us on the current queue so that we dont stop others from using important resources. At some point the priority elevations from sleeping in the kernel should go away. - Remove an optimization in sched_userret(). Before we would only set NEEDRESCHED if there was something of a higher priority available. This is a trivial optimization and it breaks priority propagation because it doesn't take threads which we may be blocking into account. Notice that the thread which is blocking others gets up to one tick of cpu time before we honor this NEEDRESCHED in sched_clock().
# 121051	12-Oct-2003	jeff	- In SCHED_CURR() add holding Giant to the list of criteria that will keep you on the current queue. In the future, it would be nice if priority propagation could deterministicly pluck a thread off of the next queue and put it on the current queue. Until then this hack stops us from holding up our entire current queue, including interrupt handlers, while a thread on the next queue is blocked while holding Giant. - Inherit our pctcpu information from our parent.
# 120754	04-Oct-2003	jeff	- Change a lame iterative algorithm to a constant time algorithm. Remove the XXX that complains about it as well. Submitted by: ThomasWuerfl@gmx.de
# 120272	20-Sep-2003	jeff	- Somewhere along the line I stupidly removed critical logic from sched_ptcpu_update(). This caused erroneous cpu times in TOP for processes that were asleep. Replace the code that was removed.
# 119488	26-Aug-2003	davidxu	Let SA process work under ULE scheduler, originally it would panic kernel. Reviewed by: jeff
# 119137	19-Aug-2003	sam	Change instances of callout_init that specify MPSAFE behaviour to use CALLOUT_MPSAFE instead of "1" for the second parameter. This does not change the behaviour; it just makes the intent more clear.
# 117326	08-Jul-2003	jeff	- When stealing a kse in kseq_move() ignore the current kseq's min nice value. We want to steal any thread, even one that is not given a slice on its current queue.
# 117313	07-Jul-2003	jeff	- Clean up an unused variable. Submitted by: Steve Kargl <skg@routmask.apl.washington.edu>
# 117237	04-Jul-2003	jeff	- Parse the cpu topology map in sched_setup(). - Associate logical CPUs on the same physical core with the same kseq. - Adjust code that assumed there would only be one running thread in any kseq. - Wrap the HTT code with a ULE_HTT_EXPERIMENTAL ifdef. This is a start towards HyperThreading support but it isn't quite there yet.
# 116970	28-Jun-2003	jeff	- Don't migrate to stopped cpus.
# 116962	28-Jun-2003	jeff	- If smp is not started yet don't try to load balance or we'll put threads on cpus that aren't running yet.
# 116955	28-Jun-2003	jeff	- Throttle the inherited sleep and run time in sched_fork_kseg(). This allows us to learn the behavior of a thread much more quickly after it starts up.
# 116946	28-Jun-2003	jeff	- Adjust the default maximum slice value to ~140ms. This has improved the nice distribution without significantly impacting interactive response. As a side effect it should also allow batch processes to run for a slightly longer period which will positively impact their performance.
# 116643	21-Jun-2003	jeff	- lticks was erroneously being updated in sched_pctcpu(). This was causing us to skip the pctcpu_update() call which lead to inaccurate cpu usage statistics for processes that didn't run often.
# 116642	21-Jun-2003	jeff	- Don't allow nice to have such a large effect on priority. This was causing poor interactive performance while unnice processes were running. The new scheme still allows nice to have an effect on priority but it is not as dramatic as the effect of the interactivity score.
# 116500	17-Jun-2003	jeff	- Use a more robust mechanism for determining whether or not a kse is on a kseq.
# 116479	17-Jun-2003	jeff	- Temporarily patch a problem where the interact score could be negative because the run time exceeds the largest value a signed int can hold. The real solution involves calculating how far we are over the limit. To quickly solve this problem we loop removing 1/5th of the current value until it falls below the limit. The common case requires no passes.
# 116463	17-Jun-2003	jeff	- Add a new function "sched_interact_update()" that scales back the sleep and run time. - Scale the sleep and run time back via sched_interact_update() in more places. This is to keep the statistic more accurate. - Charge a parent one tick for forking a child. - Add only the run time and not the sleep time to the parents kg when a thread exits. This allows us to give a penalty for having an expensive thread exit but does not give a bonus for having an interactive thread exit. - Change the SLP_RUN_THROTTLE to limit us to 4/5th and not 1/2. - Change the SLP_RUN_MAX to two seconds. This keeps bursty interactive applications like mozilla and openoffice in the interactive range even through expensive tasks. - Recalculate the slice after every sleep. This ensures that once a task has been marked interactive it only has a slice of 1 at the risk of giving tasks that sleep for a very brief period a longer time slice.
# 116377	15-Jun-2003	jeff	- Increase the ksegrp's cpu time history buffer to 250ms. - Decrease the history buffer divisor to 2 so that we remember more of the old behavior.
# 116368	15-Jun-2003	jeff	- Cap the growth of sleep and run time in sched_exit_kse().
# 116365	15-Jun-2003	jeff	- Fix the maximum slice value. I accidentally checked in a value of '2' which meant no process would run for longer than 20ms. - Slightly redo the interactivity scorer. It follows the same algorithm but in a slightly more correct way. Previously values above half were incorrect. - Lower the interactivity threshold to 20. It seems that in testing non- interactive tasks are hardly ever near there and expensive interactive tasks can sometimes surpass it. This area needs more testing. - Remove an unnecessary KTR. - Fix a case where an idle thread that had an elevated priority due to priority prop. would be placed back on the idle queue. - Delay setting NEEDRESCHED until userret() for threads that haad their priority elevated while in kernel. This gives us the same context switch optimization as SCHED_4BSD. - Limit the child's slice to 1 in sched_fork_kse() so we detect its behavior more quickly. - Inhert some of the run/slp time from the child in sched_exit_ksegrp(). - Redo some of the priority comparisons so they are more clear. - Throttle the frequency of sched_pctcpu_update() so that rounding errors do not make it invalid.
# 116361	14-Jun-2003	davidxu	Rename P_THREADED to P_SA. P_SA means a process is using scheduler activations.
# 116182	10-Jun-2003	obrien	Use __FBSDID().
# 116069	08-Jun-2003	jeff	- Add a simple CPU load balancing algorithm. This works by executing once a second and equalizing the load between the two most imbalanced CPU. This is intended to clear up long term load imbalances that would not be handled by the 'pull' method in sched_choose(). - Pull out some bits of sched_choose() into a kseq_move() function that moves an arbitrary thread from one kseq to another.
# 115998	07-Jun-2003	jeff	- When a new thread is added to a kseq the load is incremented prior to adding it to the nice tables. Therefore, in kseq_add_nice, we should keep in mind that the load will be 1 if we are the only thread, and not 0. - Assert that the sched lock is held in all the appropriate places. - Increase the scope of the sched lock in sched_pctcpu_update(). - Hold the sched lock in sched_runnable(). It is not held by the caller.
# 114496	02-May-2003	julian	Fix typo in last commit
# 114471	01-May-2003	julian	Move the flag that indicates an idle thread from the KSE to the thread. It was always referenced via the thread anyhow. Reviewed by: jhb (a LOOOOONG time ago)
# 113923	23-Apr-2003	jhb	Add lock assertions for various proc/thread/kse/ksegroup fields to the scheduler functions.
# 113873	22-Apr-2003	jhb	- Assert that the proc lock and sched_lock are held in sched_nice(). - For the 4BSD scheduler, this means that all callers of the static function resetpriority() now always hold sched_lock, so don't lock sched_lock explicitly in that function.
# 113865	22-Apr-2003	jhb	Protect p_swtime with the sched_lock.
# 113660	18-Apr-2003	jeff	- Set the ke_cpu field in sched_add() for interrupt and realtime threads since they are going on the current cpu and not their previously assigned cpu. - sched_runnable() should only return true in the SMP case if the other processor has more than one thread that is runnable. We can not steal curthread. - Change kseq_print() to accept the cpuid instead of a kseq pointer. This makes use of this function in ddb much easier.
# 113417	12-Apr-2003	jeff	- Unbreak priority prop. for timeshare threads. Always place something on the current queue if its priority is really elevated. This needs more work as there are cases where a next queue kse could be holding up what would be a curr queue kse, and thus hurting interactivity. Also, when a thread with an elevated priority has its priority lowered it should be placed back on the next queue.
# 113387	12-Apr-2003	jeff	- Clean up some debug code left over from my earlier megacommit.
# 113386	12-Apr-2003	jeff	- We only care about the base priority. Ignore the SCHED_FIFO_BIT so that we dont get confused. Reported and debugged by: Steve Kargl <sgk@troutmask.apl.washington.edu>
# 113372	11-Apr-2003	jeff	- Add sched_exit_* - Call sched_exit_kse() from sched_exit() instead of implementing it here.
# 113371	11-Apr-2003	jeff	- Only select kseqs with more than one kse to steal. The running kse is reflected in the load now and you can't very well migrate that.
# 113370	11-Apr-2003	jeff	- When migrating a kse from one kseq to the next actually insert it onto the second kseq's run queue so that it is referenced by the kse when it is switched out. - Spell ksq_rslices properly. Reported by: Ian Freislich <ianf@za.uu.net>
# 113357	11-Apr-2003	jeff	- Add a SYSCTL node for the ule scheduler. - Allow user adjustable min and max time slices (suggested by hiten). - Change the SLP_RUN_MAX to 100ms from 2 seconds so that we learn whether a process is interactive or not much more quickly. - Place a process on the current run queue if it is interactive or if it is running at an interrupt thread priority due to priority prop. - Use the 'current' timeshare queue for interrupt threads, realtime threads, and idle threads that are running at higher priority due to priority prop. This fixes problems where priorities would have been elevated but we would not check the timeshare run queue until other lower priority tasks were no longer runnable. - Keep an array of loads indexed by the priority class as well as a global load. - Keep an bucket of nice values with a count of the number of kses currently runnable with that nice value. - Keep track of the minimum nice value of any running thread. - Remove the unused short term sleep accounting. I was attempting to use this for load balancing but it didn't work out. - Define a kseq_print() for use with debugging. - Add KTR debugging at useful places so we can easily debug slice and priority assignment. - Decouple the runq assignment from the kseq assignment. kseq_add now keeps track of statistics. This is done so that the nice and load is still tracked for the currently running process. Previously if a niced process was added while a non nice process was running the niced process would still get a slice since it was not aware of the unnice process. - Make adjustments for the sched api changes.
# 113339	10-Apr-2003	julian	Move the _oncpu entry from the KSE to the thread. The entry in the KSE still exists but it's purpose will change a bit when we add the ability to lock a KSE to a cpu.
# 112994	02-Apr-2003	jeff	- Keep seperate statistics and run queues for different scheduling classes. - Treat each class specially in kseq_{choose,add,rem}. Let the rest of the code be less aware of scheduling classes. - Skip the interactivity calculation for non TIMESHARE ksegrps. - Move slice and runq selection into kseq_add(). Uninline it now that it's big.
# 112971	02-Apr-2003	jeff	- Make the interactivity calculator decay faster. - Make the pcpu estimator update faster.
# 112970	02-Apr-2003	jeff	- I meant divide by two and not shift by two in SCHED_PRI_NHALF.
# 112966	02-Apr-2003	jeff	- Add in support for KSEs with 0 slice values on the run queue. If we try to select a KSE with a slice of 0 we will update its slice and insert it onto the next queue. - Pass the KSE instead of the ksegrp into sched_slice(). This more accurately reflects the behavior of the code. Slices are granted to kses. - Add a function kseq_nice_min() which finds the smallest nice value assigned to the kseg of any KSE on the queue. - Rewrite the logic in sched_slice(). Add a large comment describing the new slice selection scheme. To summarize, slices are assigned based on the nice value. Priorities are still calculated based on the nice and interactivity of a process. Slice sizes of 0 may be granted for KSEs whos nice is 20 or futher away from the lowest nice on the run queue. Other nice values are scaled across the range [min, min+20]. This fixes ULEs bad behavior with positively niced processes.
# 111857	04-Mar-2003	jeff	- Create a function sched_interact_score() which decides on the interactivity of a kseg and assigns it a value of 0 through 100. - Use sched_interact_score() to determine the dynamic priority. - Define SCHED_CURR() in terms of sched_interact_score(). - Adjust the maximum slice back down to 100ms. - Remove redundant clearing of ke_runq in sched_wakeup() - Clean up #defines and comment them.
# 111793	03-Mar-2003	jeff	- Shift the tick count by 10 and back around sched_pctcpu_update() calculations. Keep this changes local to the function so the tick count is in its natural form otherwise. Previously 1000 was added each time a tick fired and we divided by 1000 when it was reported. This is done to reduce rounding errors.
# 111789	03-Mar-2003	jeff	- In sched_add() special case PRI_TIMESHARE and PRI_ITHD\|PRI_REALTIME. We always place ITHD & REALTIME threads on the current queue of the current cpu. Prior to this change an interrupt thread would only ever run on one cpu.
# 111788	03-Mar-2003	jeff	- Refrain from setting the td_priority in sched_wakeup(). It will be reset before we return to user space.
# 111585	27-Feb-2003	julian	Change the process flags P_KSES to be P_THREADED. This is just a cosmetic change but I've been meaning to do it for about a year.
# 111032	17-Feb-2003	julian	Move a bunch of flags from the KSE to the thread. I was in two minds as to where to put them in the first case.. I should have listenned to the other mind. Submitted by: parts by davidxu@ Reviewed by: jeff@ mini@
# 110646	10-Feb-2003	jeff	- Enable STRICT_RESCHED until code that dynamically decides on resched strictness based on the current workload is finished.
# 110645	10-Feb-2003	jeff	- Add a new variable 'kg_runtime' that tracks the amount of time we've run. - Use the ratio of kg_runtime / kg_slptime to determine our dynamic priority. - Scale kg_runtime and kg_slptime back when the sum of the two exceeds SCHED_SLP_RUN_MAX. This allows us to slowly forget old behavior. - Scale back the runtime and slptime in fork so that the new process has the same ratio but much less accumulated time. This causes new behavior to be noticed more quickly.
# 110267	03-Feb-2003	jeff	- Make some context switches conditional on SCHED_STRICT_RESCHED. This may have some negative effect on interactivity but it yields great perf. gains. This also brings the conditions under which ULE context switches inline with SCHED_4BSD. - Define some new kseq_* functions for manipulating the run queue. - Add a new kseq member ksq_rslices and ksq_bload. rslices is the sum of the slices of runnable kses. This will be used for push load balance decisions. bload is the number of threads blocked waiting on IO.
# 110260	03-Feb-2003	jeff	- Stop abusing oncpu for our cpu binding. Define a scheduler local element in the kse datastructure called ke_cpu. This is the cpu which we are currently bound to. Some flags may be added later to support hard binding.
# 110226	02-Feb-2003	scottl	Use hz if stathz is zero. Adopted from sched_4bsd.
# 110028	29-Jan-2003	jeff	- Use ksq_load as the authoritive count of kses on the pair of kseqs for sched_runnable() et all. - Remove some dead code in sched_clock(). - Define two macros KSEQ_SELF() and KSEQ_CPU() for getting the kseq of the current cpu or some alternate cpu. - Start introducing kseq_() functions, such as kseq_choose() and kseq_setup().
# 110014	28-Jan-2003	jeff	- Remove debugging code that didn't work on UP.
# 109971	28-Jan-2003	jeff	- Allow idle's pctcpu time to be calculated.
# 109970	28-Jan-2003	jeff	- Fix the ksq_load calculation. It now reflects the number of entries on the run queue for each cpu. - Introduce kse stealing into the sched_choose() code. This helps balance cpus better in cases where process turnover is high. This implementation is fairly trivial and will likely be only a temporary measure until something more sophisticated has been written.
# 109864	26-Jan-2003	jeff	- Add the ule scheduler. This is intended to be a general purpose process scheduler with many SMP benefits. It is still very experimental and should be used only in test environments.