History log of /freebsd-9.3-release/sys/kern/sched_ule.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 262057 17-Feb-2014 avg

MFC r258622,258675: dtrace sdt: remove the ugly sname parameter of
SDT_PROBE_DEFINE


# 259986 27-Dec-2013 dim

MFC r259875:

In sys/kern/sched_ule.c, remove static function sched_both(), which is
unused since r232207.


# 259835 24-Dec-2013 jhb

MFC 258869:
Fix an off-by-one error in r228960. The maximum priority delta provided
by SCHED_PRI_TICKS should be SCHED_PRI_RANGE - 1 so that the resulting
priority value (before nice adjustment) is between SCHED_PRI_MIN and
SCHED_PRI_MAX, inclusive.


# 256208 09-Oct-2013 mav

MFC r255363:
Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.


# 255569 14-Sep-2013 mav

Temporary revert r255541 since there is no CPU_FFS in stable/9 yet. Sorry.


# 255541 14-Sep-2013 mav

MFC r255363:
Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.


# 252070 21-Jun-2013 gnn

MFC: 249514
Point args[0] not at the thread that is ending but at the one that
is starting. This is in line with practice in OpenSolaris.

Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.

PR: 177706
Submitted by: Tiwei Bie


# 247150 22-Feb-2013 mav

MFC r242852, r243069:
Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no context switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal. Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
- Change code that implements spinning for load to restart spin in case of
context switch. Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
- Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.


# 242544 04-Nov-2012 eadler

MFC r241844:
remove duplicate semicolons where possible.

Approved by: cperciva (implicit)


# 241250 06-Oct-2012 mav

MFC r239194:
Allow idle threads to steal second threads from other cores on systems with
8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2)
system shown any slowdown from mentioned "excess thrashing". Same time in
pbzip2 test with number of threads more then number of CPUs I see up to 10%
speedup with SMT disabled and up 5% with SMT enabled. Thinking about
trashing I was trying to limit that stealing within same last level cache,
but got only worse results. Present code any way prefers to steal threads
from topologically closer cores.


# 241249 06-Oct-2012 mav

MFC r239185, r239196:
Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
- restore (4BSD) and implement (ULE) hogticks variable setting;
- make sched_rr_interval() more tolerant to options;
- restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
- tune some sysctl descriptions;
- make some style fixes.


# 241248 06-Oct-2012 mav

MFC r239157:
Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.


# 240839 22-Sep-2012 avg

MFC r240513: sched_ule: fix inverted condition in reporting of priority
lending via ktr


# 236547 04-Jun-2012 mav

MFC r234066:
Microoptimize cpu_search().

According to profiling, it makes one take 6% of CPU time on hackbench
with its million of context switches per second, instead of 8% before.


# 236546 04-Jun-2012 mav

MFC r232740:
Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.


# 236344 30-May-2012 rstone

MFC r235459 and r235471

r235459:
Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives. Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire. These probes are defined
to fire when Solaris-specific features perform certain actions. As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change. These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument. As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated

r235471:
Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP.


# 234166 12-Apr-2012 mav

MFC r232917:
Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.


# 233814 02-Apr-2012 jhb

MFC 232700:
Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().


# 233599 28-Mar-2012 mav

MFC r232207, r232454:
Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.

All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.

Sponsored by: iXsystems, Inc.


# 230691 28-Jan-2012 marius

MFC: r226057

- Currently, sched_balance_pair() may cause a CPU to send an IPI_PREEMPT to
itself, which sparc64 hardware doesn't support. One way to solve this
would be to directly call sched_preempt() instead of issuing a self-IPI.
However, quoting jhb@:
"On the other hand, you can probably just skip the IPI entirely if we are
going to send it to the current CPU. Presumably, once this routine
finishes, the current CPU will exit softlock (or will do so "soon") and
will then pick the next thread to run based on the adjustments made in
this routine, so there's no need to IPI the CPU running this routine
anyway. I think this is the better solution. Right now what is probably
happening on other platforms is as soon as this routine finishes the CPU
processes its self-IPI and causes mi_switch() which will just switch back
to the softclock thread it is already running."
- With r226054 (MFC'ed to stable/9 in r230690) and the the above change in
place, sparc64 now no longer is incompatible with ULE and vice versa.
However, powerpc/E500 still is.

Submitted by: jhb
Reviewed by: jeff


# 230173 15-Jan-2012 avg

MFC r228718: ule: ensure that batch timeshare threads are scheduled
fairly


# 230083 13-Jan-2012 jhb

MFC 228960:
Cap the priority calculated from the current thread's running tick count
at SCHED_PRI_RANGE to prevent overflows in the priority value. This can
happen due to irregularities with clock interrupts under certain
virtualization environments.


# 230079 13-Jan-2012 jhb

MFC 229429:
Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.


# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 225199 26-Aug-2011 delphij

Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.

Submitted by: Ivan Klymenko <fidaj@ukr.net>
PR: kern/159904, kern/159905
MFC after: 2 weeks
Approved by: re (kib)


# 224221 19-Jul-2011 attilio

Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace
pollution. That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.

Reported by: jhb
Reviewed and tested by: pluknet
Approved by: re (kib)


# 222813 07-Jun-2011 attilio

etire the cpumask_t type and replace it with cpuset_t usage.

This is intended to fix the bug where cpu mask objects are
capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever
value. Anyway, as long as several structures in the kernel are
statically allocated and sized as MAXCPU, it is suggested to keep it
as low as possible for the time being.

Technical notes on this commit itself:
- More functions to handle with cpuset_t objects are introduced.
The most notable are cpusetobj_ffs() (which calculates a ffs(3)
for a cpuset_t object), cpusetobj_strprint() (which prepares a string
representing a cpuset_t object) and cpusetobj_strscan() (which
creates a valid cpuset_t starting from a string representation).
- pc_cpumask and pc_other_cpus are target to be removed soon.
With the moving from cpumask_t to cpuset_t they are now inefficient
and not really useful. Anyway, for the time being, please note that
access to pcpu datas is protected by sched_pin() in order to avoid
migrating the CPU while reading more than one (possible) word
- Please note that size of cpuset_t objects may differ between kernel
and userland. While this is not directly related to the patch itself,
it is good to understand that concept and possibly use the patch
as a reference on how to deal with cpuset_t objects in userland, when
accessing kernland members.
- KTR_CPUMASK is changed and now is represented through a string, to be
set as the example reported in NOTES.

Please additively note that no MAXCPU is bumped in this patch, but
private testing has been done until to MAXCPU=128 on a real 8x8x2(htt)
machine (amd64).

Please note that the FreeBSD version is not yet bumped because of
the upcoming pcpu changes. However, note that this patch is not
targeted for MFC.

People to thank for the time spent on this patch:
- sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested
several revision of the patches and really helped in improving
stability of this work.
- marius fixed several bugs in the sparc64 implementation and reviewed
patches related to ktr.
- jeff and jhb discussed the basic approach followed.
- kib and marcel made targeted review on some specific part of the
patch.
- marius, art, nwhitehorn and andreast reviewed MD specific part of
the patch.
- marius, andreast, gonzo, nwhitehorn and jceel tested MD specific
implementations of the patch.
- Other people have made contributions on other patches that have been
already committed and have been listed separately.

Companies that should be mentioned for having participated at several
degrees:
- Yahoo! for having offered the machines used for testing on big
count of CPUs.
- The FreeBSD Foundation for having sponsored my devsummit attendance,
which has been instrumental.
- Sandvine for having offered offices and infrastructure during
development.

(I really hope I didn't forget anyone, if it happened I apologize in
advance).


# 220198 31-Mar-2011 fabient

Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by: jhb, jeff
MFC after: 1 month


# 217410 14-Jan-2011 jhb

Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
the timesharing priority band can be increased. The new timeshare range
is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
interactive threads. Instead, the larger timeshare range is now split
into separate subranges for interactive and non-interactive ("batch")
threads. The end result is that interactive threads and non-interactive
threads still use the same priority ranges as before, but realtime
threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
or via cv_broadcastpri(). Realtime and idle priority threads will
no longer have their priorities affected by sleeping in the kernel.

Reviewed by: jeff


# 217351 13-Jan-2011 jhb

Introduce two new helper macros to define the priority ranges used for
interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive
timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME
and PRI_*_TIMESHARE. No functional change.

Reviewed by: jeff


# 217291 11-Jan-2011 jhb

Always use PRI_BASE() when checking the base type of a thread's priority
class.

MFC after: 2 weeks


# 217237 10-Jan-2011 jhb

Fix two harmless off-by-one errors.

Reviewed by: jeff
MFC after: 2 weeks


# 217078 06-Jan-2011 jhb

- Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
proc. Otherwise attempts to modify thread or process data in sched_fork()
could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
sched_fork_thread() in ULE. This is already done courtesy the bcopy()
of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
new thread's base priority (td_base_pri) to avoid bogusly inheriting a
borrowed priority from the parent thread.

MFC after: 2 weeks


# 216791 29-Dec-2010 davidxu

- Follow r216313, the sched_unlend_user_prio is no longer needed, always
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
lowered, repropagete priorities, this may cause mutex owner's priority
to be lowerd, in old code, mutex owner's priority is rise-only.


# 216313 09-Dec-2010 davidxu

MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week


# 215240 13-Nov-2010 trasz

Remove unused variables.


# 215102 10-Nov-2010 attilio

Fix typos.

Submitted by: gianni
MFC after: 3 days


# 214611 31-Oct-2010 davidxu

Use integer for size of cpuset, as it won't be bigger than INT_MAX,
This is requested by bge.
Also move the sysctl into file kern_cpuset.c, because it should
always be there, it is independent of thread scheduler.


# 214510 29-Oct-2010 davidxu

Add sysctl kern.sched.cpusetsize to export the size of kernel cpuset,
also add sysconf() key _SC_CPUSET_SIZE to get sysctl value.

Submitted by: gcooper


# 212974 21-Sep-2010 jhb

Comment nit, set TDF_NEEDRESCHED after the comment describing why it is
done rather than before.

MFC after: 1 week


# 212821 18-Sep-2010 avg

kern.sched.topology_spec sysctl: use step of 1 for group levels numeration

This is just a cosmetic change for prettier output.
'indent' variable/parameter serves two purposes: it specifies whitespace
indentation level and also implies cpu group level/depth.
It would have been better to split those two uses,
but for now just a simple change.

MFC after: 1 week


# 212541 13-Sep-2010 mav

Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.


# 212416 10-Sep-2010 mav

Do not IPI CPU that is already spinning for load. It doubles effect of
spining (comparing to MWAIT) on some heavly switching test loads.


# 212153 02-Sep-2010 mdf

Fix UP build.

MFC after: 2 weeks


# 212115 01-Sep-2010 mdf

Fix a bug with sched_affinity() where it checks td_pinned of another
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU. Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.

KASSERT in sched_bind() that the thread is not yet pinned to a CPU.

KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.

sched_affinity code came from jhb@.

MFC after: 2 weeks


# 211515 19-Aug-2010 jhb

Remove unused KTRACE includes.


# 210939 06-Aug-2010 jhb

Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month


# 210117 15-Jul-2010 ivoras

A cosmetic change - don't output empty <flags>.


# 209059 11-Jun-2010 jhb

Update several places that iterate over CPUs to use CPU_FOREACH().


# 208983 10-Jun-2010 ivoras

Unconfuse THREAD and SMT flags


# 208982 10-Jun-2010 ivoras

Cosmetic change to XML - less ugly newlines


# 208787 03-Jun-2010 jhb

Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it. All of the current callers already hold the
lock.

MFC after: 1 month


# 208391 21-May-2010 jhb

Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after: 1 month


# 208165 16-May-2010 rrs

This pushes all of JC's patches that I have in place. I
am now able to run 32 cores ok.. but I still will hang
on buildworld with a NFS problem. I suspect I am missing
a patch for the netlogic rge driver.

JC check and see if I am missing anything except your
core-mask changes

Obtained from: JC


# 202889 23-Jan-2010 attilio

- Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
sched_lock was acquired (without the aid of the td_lock interface) and
the td_lock was dropped. This was going to break locking rules on other
threads willing to access to the thread (via the td_lock interface) and
modify his flags (allowed as long as the container lock was different
by the one used in sched_switch).
In order to prevent this situation, while sched_lock is acquired there
the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
thread_lock_block() and make the former semantic as the default for
thread_lock_block(). This means that thread_lock_block() will not
disable interrupts when called (and consequently thread_unlock_block()
will not re-enabled them when called). This should be done manually
when necessary.
Note, however, that ULE's thread_unblock_switch() is not reaped
because it does reflect a difference in semantic due in ULE (the
td_lock may not be necessarilly still blocked_lock when calling this).
While asymmetric, it does describe a remarkable difference in semantic
that is good to keep in mind.

[0] Reported by: Kohji Okuno
<okuno dot kohji at jp dot panasonic dot com>
Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
MFC: 2 weeks


# 201347 31-Dec-2009 kib

Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by: pho
Reviewed by: jhb
MFC after: 1 week


# 201148 28-Dec-2009 ed

Don't forget to use `void' for sched_balance(). It has no arguments.


# 199764 24-Nov-2009 ivoras

Make ULE process usage (%CPU) accounting usable again by keeping track
of the last tick we incremented on.

Submitted by: matthew.fleming/at/isilon.com, is/at/rambler-co.ru
Reviewed by: jeff (who thinks there should be a better way in the future)
Approved by: gnn (mentor)
MFC after: 3 weeks


# 198854 03-Nov-2009 attilio

Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>


# 198126 15-Oct-2009 jhb

Fix a sign bug in the handling of nice priorities when computing the
interactive score for a thread.

Submitted by: Taku YAMAMOTO taku of tackymt.homeip.net
Reviewed by: jeff
MFC after: 3 days


# 197223 15-Sep-2009 attilio

Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
even with different locks, with the contribution of tdq_lock_pair().
An explanation is here:
(hypotesis: a thread needs to migrate on another CPU, thread1 is doing
sched_switch_migrate() and thread2 is the one handling the sched_switch()
request or in other words, thread1 is the thread that needs to migrate
and thread2 is a thread that is going to be preempted, most likely an
idle thread. Also, 'old' is referred to the context (in terms of
run-queue and CPU) thread1 is leaving and 'new' is referred to the
context thread1 is going into. Finally, thread3 is doing tdq_idletd()
or sched_balance() and definitively doing tdq_lock_pair())

* thread1 blocks its td_lock. Now td_lock is 'blocked'
* thread1 drops its old runqueue lock
* thread1 acquires the new runqueue lock
* thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
through tdq_notify() to the new CPU
* thread1 drops the new lock
* thread3, scanning the runqueues, locks the old lock
* thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
pointing to the new runqueue
* thread3 wants to acquire the new runqueue lock, but it can't because
it is held by thread2 so it spins
* thread1 wants to acquire old lock, but as long as it is held by
thread3 it can't
* thread2 going further, at some point wants to switchin in thread1,
but it will wait forever because thread1->td_lock is in blocked state

This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.

Reviewed by: jeff
Reported by: des, others
Tested by: des, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
(STABLE_7 based version)


# 194779 23-Jun-2009 jeff

- Use cpuset_t and the CPU_ macros in place of cpumask_t so that ULE
supports arbitrary numbers of cpus rather than being limited by
cpumask_t to the number of bits in a long.


# 191676 29-Apr-2009 jeff

- Fix non-SMP build by encapsulating idle spin logic in a macro.

Pointy hat to: me


# 191645 29-Apr-2009 jeff

- Fix the FBSDID line.


# 191643 29-Apr-2009 jeff

- Remove the bogus idle thread state code. This may have a race in it
and it only optimized out an ipi or mwait in very few cases.
- Skip the adaptive idle code when running on SMT or HTT cores. This
just wastes cpu time that could be used on a busy thread on the same
core.
- Rename CG_FLAG_THREAD to CG_FLAG_SMT to be more descriptive. Re-use
CG_FLAG_THREAD to mean SMT or HTT.

Sponsored by: Nokia


# 189787 14-Mar-2009 jeff

- Fix an error that occurs when mp_ncpu is an odd number. steal_thresh
is calculated as 0 which causes errors elsewhere.

Submitted by: KOIE Hidetaka <koie@suri.co.jp>

- When sched_affinity() is called with a thread that is not curthread we
need to handle the ON_RUNQ() case by adding the thread to the correct
run queue.

Submitted by: Justin Teller <justin.teller@gmail.com>

MFC after: 1 Week


# 187679 25-Jan-2009 jeff

- Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
something more reasonable such as sizeof("32"). This shouldn't have
caused any ill effect until we run on machines with 1000000 or more
cpus.


# 187357 17-Jan-2009 jeff

- Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py. This allows developers to quickly
create a graphical view of ktr data for any resource in the system.
- Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
identifying records associated with a thread or cpu.
- Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from: attilio
Discussed with: jhb
Sponsored by: Nokia


# 186435 23-Dec-2008 ivoras

Add missing newlines to flags tags of CPU topology, for prettier
output.

Reviewed by: jeff (original version)
Approved by: gnn (mentor) (original version)


# 185047 18-Nov-2008 jhb

When checking to see if another CPU is running its idle thread, examine
the thread running on the other CPU instead of the thread being placed on
the run queue.

Reported by: Ravi Murty @ Intel
Reviewed by: jeff


# 184570 02-Nov-2008 ivoras

Increase the initial sbuf size for CPU topology dump to something more
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.

Approved by: gnn (mentor) (older version)
Pointed out by: rdivacky


# 184439 29-Oct-2008 ivoras

Introduce a new sysctl, kern.sched.topology_spec, that returns an XML
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.

An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>

Reviewed by: jeff
Approved by: gnn (mentor)


# 180607 19-Jul-2008 jeff

- Check whether we've recorded this tick in ts_ticks on another cpu in
sched_tick() to prevent multiple increments for one tick. This pushes
the value out of range and breaks priority calculation.

Reviewed by: kib
Found by: pho/nokia
Sponsored by: Nokia
MFC after: 3 days


# 179297 24-May-2008 jb

Add the vtime (virtual time) hooks for DTrace.


# 178471 25-Apr-2008 jeff

- Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
- Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
suspended in cpu specific states. This function can fail and cause the
scheduler to fall back to another mechanism (ipi).
- Implement support for mwait in cpu_idle() on i386/amd64 machines that
support it. mwait is a higher performance way to synchronize cpus
as compared to hlt & ipis.
- Allow selecting the idle routine by name via sysctl machdep.idle. This
replaces machdep.cpu_idle_hlt. Only idle routines supported by the
current machine are permitted.

Sponsored by: Nokia


# 178277 17-Apr-2008 jeff

- Add a metric to describe how busy a processor has been over the last
two ticks by counting the number of switches and the load when
sched_clock() is called.
- If the busy metric exceeds a threshold allow the idle thread to spin
waiting for new work for a brief period to avoid using IPIs. This
reduces the cost on the sender and receiver as well as reducing wakeup
latency considerably when it works.

Sponsored by: Nokia


# 178272 17-Apr-2008 jeff

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


# 178215 15-Apr-2008 marcel

Support and switch to the ULE scheduler:
o Implement IPI_PREEMPT,
o Set td_lock for the thread being switched out,
o For ULE & SMP, loop while td_lock points to blocked_lock for
the thread being switched in,
o Enable ULE by default in GENERIC and SKI,


# 177903 03-Apr-2008 jeff

- Allow static_boost to specify no boost with '0', traditional kernel
fixed pri boost with '1' or any priority less than the current thread's
priority with a value greater than two. Default the boost to
PRI_MIN_TIMESHARE to prevent regular user-space threads from starving
threads in the kernel. This prevents these user-threads from also
being scheduled as if they are high fixed-priority kernel threads.
- Restore the setting of lowpri in tdq_choose(). It has to be either here
or in sched_switch(). I accidentally removed it from both places.

Tested by: kris


# 177902 03-Apr-2008 jeff

- Don't check for the ITHD pri class in tdq_load_add and rem. 4BSD doesn't
do this either. Simply check P_NOLOAD. It'd be nice if this was
in a thread flag so we didn't have an extra cache miss every time we
add and remove a thread from the run-queue.


# 177435 20-Mar-2008 jeff

- Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
- Compile kern_switch.c independently again and stop #include'ing it from
schedulers.
- Remove the ts_thread backpointers and convert most code to go from
struct thread to struct td_sched.
- Cleanup the ts_flags #define garbage that was causing us to sometimes
do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
- Export the kern.sched sysctl node in sysctl.h


# 177426 20-Mar-2008 jeff

- ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread(). The td_sched pointer is already
setup by thread_init anyway.


# 177376 19-Mar-2008 jeff

- Remove some dead code and comments related to KSE.
- Don't set tdq_lowpri on every switch, it should be precisely maintained now.
- Add some comments to sched_thread_priority().


# 177368 19-Mar-2008 jeff

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 177169 14-Mar-2008 jhb

Make the function prototype for cpu_search() match the declaration so that
this still compiles with gcc3.


# 177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 177085 12-Mar-2008 jeff

- Pass the priority argument from *sleep() into sleepq and down into
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.

Reviewed by: jhb, peter


# 177042 10-Mar-2008 jeff

- Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
when we adjusted the priority. This involves the same number of
branches as before so should perform identically without the extra
fragility.

Tested by: bz
Reviewed by: bz


# 177011 10-Mar-2008 jeff

- Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0. Set it directly in sched_setup(). This fixes traps on boot
seen on some machines.

Reported by: phk


# 177009 10-Mar-2008 jeff

Reduce ULE context switch time by over 25%.

- Only calculate timeshare priorities once per tick or when a thread is woken
from sleeping.
- Keep the ts_runq pointer valid after all priority changes.
- Call tdq_runq_add() directly from sched_switch() without passing in via
tdq_add(). We don't need to adjust loads or runqs anymore.
- Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by: Nokia


# 177005 09-Mar-2008 jeff

- Add an implementation of sched_preempt() that avoids excessive IPIs.
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
so the logical is identical and not repeated between tdq_notify() and
sched_setpreempt().
- In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
this could have caused us to lose td_flags settings.
- Garbage collect some tunables that are no longer relevant.


# 176735 02-Mar-2008 jeff

Add support for the new cpu topology api:
- When searching for affinity search backwards in the tree from the last
cpu we ran on while the thread still has affinity for the group. This
can take advantage of knowledge of shared L2 or L3 caches among a
group of cores.
- When searching for the least loaded cpu find the least loaded cpu via
the least loaded path through the tree. This load balances system bus
links, individual cache levels, and hyper-threaded/SMT cores.
- Make the periodic balancer recursively balance the highest and lowest
loaded cpu across each link.

Add support for cpusets:
- Convert the cpuset to a simple native cpumask_t while the kernel still
only supports cpumask.
- Pass the derived cpumask down through the cpu_search functions to
restrict the result cpus.
- Make the various steal functions resilient to failure since all threads
can not run on all cpus any longer.

General improvements:
- Precisely track the lowest priority thread on every runq with
tdq_setlowpri(). Before it was more advisory but this ended up having
pathological behaviors.
- Remove many #ifdef SMP conditions to simplify the code.
- Get rid of the old cumbersome tdq_group. This is more naturally
expressed via the cpu_group tree.

Sponsored by: Nokia
Testing by: kris


# 176734 02-Mar-2008 jeff

- Remove the old smp cpu topology specification with a new, more flexible
tree structure that encodes the level of cache sharing and other
properties.
- Provide several convenience functions for creating one and two level
cpu trees as well as a default flat topology. The system now always
has some topology.
- On i386 and amd64 create a seperate level in the hierarchy for HTT
and multi-core cpus. This will allow the scheduler to intelligently
load balance non-uniform cores. Presently we don't detect what level
of the cache hierarchy is shared at each level in the topology.
- Add a mechanism for testing common topologies that have more information
than the MD code is able to provide via the kern.smp.topology tunable.
This should be considered a debugging tool only and not a stable api.

Sponsored by: Nokia


# 176729 02-Mar-2008 jeff

- Add a new sched_affinity() api to be used in the upcoming cpuset
implementation.
- Add empty implementations of sched_affinity() to 4BSD and ULE.

Sponsored by: Nokia


# 175587 23-Jan-2008 jeff

- sched_prio() should only adjust tdq_lowpri if the thread is running or on
a run-queue. If the priority is numerically raised only change lowpri
if we're certain it will be correct. Some slop is allowed however
previously we could erroneously raise lowpri for an idle cpu that a
thread had recently run on which lead to errors in load balancing
decisions.


# 175348 15-Jan-2008 jeff

- When executing the 'tryself' branch in sched_pickcpu() look at the
lowest priority on the queue for the current cpu vs curthread's
priority. In the case that curthread is waking up many threads of a
lower priority as would happen with a turnstile_broadcast() or wakeup()
of many threads this prevents them from all ending up on the current cpu.
- In sched_add() make the relationship between a scheduled ithread and
the current cpu advisory rather than strict. Only give the ithread
affinity for the current cpu if it's actually being scheduled from
a hardware interrupt. This prevents it from migrating when it simply
blocks on a lock.

Sponsored by: Nokia


# 175104 05-Jan-2008 jeff

- Restore timeslicing code for all bit SCHED_FIFO priority classes.

Reported by: Peter Jeremy <peterjeremy@optushome.com.au>


# 174847 21-Dec-2007 wkoszek

Make SCHED_ULE buildable with gcc3.

Reviewed by: cognet (mentor), jeffr
Approved by: cognet (mentor), jeffr


# 174629 15-Dec-2007 jeff

- Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled. There is no longer an embedded lock_profile_object
in each lock. Instead a list of lock_profile_objects is kept per-thread
for each lock it may own. The cnt_hold statistic is now always 0 to
facilitate this.
- Support shared locking by tracking individual lock instances and
statistics in the per-thread per-instance lock_profile_object.
- Make the lock profiling hash table a per-cpu singly linked list with a
per-cpu static lock_prof allocator. This removes the need for an array
of spinlocks and reduces cache contention between cores.
- Use a seperate hash for spinlocks and other locks so that only a
critical_enter() is required and not a spinlock_enter() to modify the
per-cpu tables.
- Count time spent spinning in the lock statistics.
- Remove the LOCK_PROFILE_SHARED option as it is always supported now.
- Specifically drop and release the scheduler locks in both schedulers
since we track owners now.

In collaboration with: Kip Macy
Sponsored by: Nokia


# 174536 11-Dec-2007 davidxu

Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days


# 173600 14-Nov-2007 julian

generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.


# 172887 22-Oct-2007 grehan

Cut over to ULE on PowerPC

kern/sched_ule.c - Add __powerpc__ to the list of supported architectures

powerpc/conf/GENERIC - Swap SCHED_4BSD with SCHED_ULE

powerpc/powerpc/genassym.c - Export TD_LOCK field of thread struct

powerpc/powerpc/swtch.S - Handle new 3rd parameter to cpu_switch() by
updating the old thread's lock. Note: uniprocessor-only, will require
modification for MP support.

powerpc/powerpc/vm_machdep.c - Set 3rd param of cpu_switch to mutex of
old thread's lock, making the call a no-op.

Reviewed by: marcel, jeffr (slightly older version)


# 172709 16-Oct-2007 sam

ULE works fine on arm; allow it to be used

Reviewed by: jeff, cognet, imp
MFC after: 1 week


# 172484 08-Oct-2007 jeff

- Bail out of tdq_idled if !smp_started or idle stealing is disabled. This
fixes a bug on UP machines with SMP kernels where the idle thread
constantly switches after trying to steal work from the local cpu.
- Make the idle stealing code more robust against self selection.
- Prefer to steal from the cpu with the highest load that has at least one
transferable thread. Before we selected the cpu with the highest
transferable count which excludes bound threads.

Collaborated with: csjp
Approved by: re


# 172483 08-Oct-2007 jeff

- Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching. This allows
threads of equal priority to run but not lesser priority.

Discussed with: davidxu
Reported by: NIIMI Satoshi <sa2c@sa2c.net>
Approved by: re


# 172411 01-Oct-2007 jeff

- Reassign the thread queue lock to newtd prior to switching. Assigning
after the switch leads to a race where the outgoing thread still owns
the local queue lock while another cpu may switch it in. This race
is only possible on machines where cpu_switch can take significantly
longer on different cpus which in practice means HTT machines with
unfair thread scheduling algorithms.

Found by: kris (of course)
Approved by: re


# 172409 01-Oct-2007 jeff

- Move the rebalancer back into hardclock to prevent potential softclock
starvation caused by unbalanced interrupt loads.
- Change the rebalancer to work on stathz ticks but retain randomization.
- Simplify locking in tdq_idled() to use the tdq_lock_pair() rather than
complex sequences of locks to avoid deadlock.

Reported by: kris
Approved by: re


# 172345 27-Sep-2007 jeff

- Honor the PREEMPTION and FULL_PREEMPTION flags by setting the default
value for kern.sched.preempt_thresh appropriately. It can still by
adjusted at runtime. ULE will still use IPI_PREEMPT in certain
migration situations.
- Assert that we're not trying to compile ULE on an unsupported
architecture. To date, I believe only i386 and amd64 have implemented
the third cpu switch argument required.

Approved by: re


# 172308 23-Sep-2007 jeff

- Bound the interactivity score so that it cannot become negative.

Approved by: re


# 172293 22-Sep-2007 jeff

- Improve grammar. s/it's/its/.
- Improve load long-term load balancer by always IPIing exactly once.
Previously the delay after rebalancing could cause problems with
uneven workloads.
- Allow nice to have a linear effect on the interactivity score. This
allows negatively niced programs to stay interactive longer. It may be
useful with very expensive Xorg servers under high loads. In general
it should not be necessary to alter the nice level to improve interactive
response. We may also want to consider never allowing positively niced
processes to become interactive at all.
- Initialize ccpu to 0 rather than 0.0. The decimal point was leftover
from when the code was copied from 4bsd. ccpu is 0 in ULE because ULE
only exports weighted cpu values.

Reported by: Steve Kargl (Load balancing problem)
Approved by: re


# 172264 21-Sep-2007 jeff

- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.

Approved by: re


# 172207 17-Sep-2007 jeff

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 171899 20-Aug-2007 jeff

- Set steal_thresh to log2(ncpus). This improves idle-time load balancing
on 2cpu machines by reducing it to 1 by default. This improves loaded
operation on 8cpu machines by increasing it to 3 where the extra idle
time is not as critical.

Approved by: re


# 171715 03-Aug-2007 jeff

- Fix one line that erroneously crept in my last commit.

Approved by: re


# 171713 03-Aug-2007 jeff

- Share scheduler locks between hyper-threaded cores to protect the
tdq_group structure. Hyper-threaded cores won't really benefit from
seperate locks anyway.
- Seperate out the migration case from sched_switch to simplify the main
switch code. We only migrate here if called via sched_bind().
- When preempted place the preempted thread back in the same queue at
the head.
- Improve the cpu group and topology infrastructure.

Tested by: many on current@
Approved by: re


# 171506 19-Jul-2007 jeff

- Refine the load balancer to improve buildkernel times on dual core
machines.
- Leave the long-term load balancer running by default once per second.
- Enable stealing load from the idle thread only when the remote processor
has more than two transferable tasks. Setting this to one further
improves buildworld. Setting it higher improves mysql.
- Remove the bogus pick_zero option. I had not intended to commit this.
- Entirely disallow migration for threads with SRQ_YIELDING set. This
balances out the extra migration allowed for with the load balancers.
It also makes pick_pri perform better as I had anticipated.

Tested by: Dmitry Morozovsky <marck@rinet.ru>
Approved by: re


# 171505 19-Jul-2007 jeff

- When newtd is specified to sched_switch() it was not being initialized
properly. We have to temporarily unlock the TDQ lock so we can lock
the thread and add it to the run queue. This is used only for KSE.
- When we add a thread from the tdq_move() via sched_balance() we need to
ipi the target if it's sitting in the idle thread or it'll never run.

Reported by: Rene Landan
Approved by: re


# 171482 17-Jul-2007 jeff

ULE 3.0: Fine grain scheduler locking and affinity improvements. This has
been in development for over 6 months as SCHED_SMP.
- Implement one spin lock per thread-queue. Threads assigned to a
run-queue point to this lock via td_lock.
- Improve the facility for assigning threads to CPUs now that sched_lock
contention no longer dominates scheduling decisions on larger SMP
machines.
- Re-write idle time stealing in an attempt to make it less damaging to
general performance. This is still disabled by default. See
kern.sched.steal_idle.
- Call the long-term load balancer from a callout rather than sched_clock()
so there are no locks held. This is disabled by default. See
kern.sched.balance.
- Parameterize many scheduling decisions via sysctls. Try to document
these via sysctl descriptions.
- General structural and naming cleanups.
- Document each function with comments.

Tested by: current@ amd64, x86, UP, SMP.
Approved by: re


# 170787 15-Jun-2007 jeff

- Fix an off by one error in sched_pri_range.
- In tdq_choose() only assert that a thread does not have too high a
priority (low value) for the queue we removed it from. This will catch
bugs in priority elevation. It's not a serious error for the thread
to have too low a priority as we don't change queues in this case as
an optimization.

Reported by: kris


# 170600 12-Jun-2007 jeff

- Move some common code out of sched_fork_exit() and back into fork_exit().


# 170358 06-Jun-2007 jeff

- Placing the 'volatile' on the right side of the * in the td_lock
declaration removes the need for __DEVOLATILE().

Pointed out by: tegge


# 170317 05-Jun-2007 jeff

- Better fix for previous error; use DEVOLATILE on the td_lock pointer
it can actually sometimes be something other than sched_lock even on
schedulers which rely on a global scheduler lock.

Tested by: kan


# 170316 05-Jun-2007 jeff

- Pass &sched_lock as the third argument to cpu_switch() as this will
always be the correct lock and we don't get volatile warnings this
way.

Pointed out by: kan


# 170315 05-Jun-2007 jeff

- Define TDQ_ID() for the !SMP case.
- Default pick_pri to off. It is not faster in most cases.


# 170293 04-Jun-2007 jeff

Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 168891 20-Apr-2007 kmacy

Schedule the ithread on the same cpu as the interrupt

Tested by: kmacy
Submitted by: jeffr


# 167669 17-Mar-2007 jeff

- Handle the case where slptime == runtime.

Submitted by: Atoine Brodin


# 167664 17-Mar-2007 jeff

- Cast the intermediate value in priority computtion back down to
unsigned char. Weirdly, casting the 1 constant to u_char still produces
a signed integer result that is then used in the % computation. This
avoids that mess all together and causes a 0 pri to turn into 255 % 64
as we expect.

Reported by: kkenn (about 4 times, thanks)


# 167327 08-Mar-2007 julian

Instead of doing comparisons using the pcpu area to see if
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.


# 167012 26-Feb-2007 kmacy

general LOCK_PROFILING cleanup

- only collect timestamps when a lock is contested - this reduces the overhead
of collecting profiles from 20x to 5x

- remove unused function from subr_lock.c

- generalize cnt_hold and cnt_lock statistics to be kept for all locks

- NOTE: rwlock profiling generates invalid statistics (and most likely always has)
someone familiar with that should review


# 166557 07-Feb-2007 jeff

- Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well. This fixes bugs in priority index
calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by: kkenn using pho's stress2.


# 166247 25-Jan-2007 jeff

- Implement much more intelligent ipi sending. This algorithm tries to
minimize IPIs and rescheduling when scheduling like tasks while keeping
latency low for important threads.
1) An idle thread is running.
2) The current thread is worse than realtime and the new thread is
better than realtime. Realtime to realtime doesn't preempt.
3) The new thread's priority is less than the threshold.


# 166229 25-Jan-2007 jeff

- Get rid of the unused DIDRUN flag. This was really only present to
support sched_4bsd.
- Rename the KTR level for non schedgraph parsed events. They take event
space from things we'd like to graph.
- Reset our slice value after we sleep. The slice is simply there to
prevent starvation among equal priorities. A thread which had almost
exhausted it's slice and then slept doesn't need to be rescheduled a
tick after it wakes up.
- Set the maximum slice value to a more conservative 100ms now that it is
more accurately enforced.


# 166208 24-Jan-2007 jeff

- With a sleep time over 2097 seconds hzticks and slptime could end up
negative. Use unsigned integers for sleep and run time so this doesn't
disturb sched_interact_score(). This should fix the invalid interactive
priority panics reported by several users.


# 166190 23-Jan-2007 jeff

- Catch up to setrunqueue/choosethread/etc. api changes.
- Define our own maybe_preempt() as sched_preempt(). We want to be able
to preempt idlethread in all cases.
- Define our idlethread to require preemption to exit.
- Get the cpu estimation tick from sched_tick() so we don't have to worry
about errors from a sampling interval that differs from the time
domain. This was the source of sched_priority prints/panics and
inaccurate pctcpu display in top.


# 166156 20-Jan-2007 jeff

- Disable the long-term load balancer. I believe that steal_busy works
better and gives more predictable results.


# 166152 20-Jan-2007 jeff

- We do need to IPI the idlethread on some systems. It may be stuck in
a power saving mode otherwise.
- If the thread is already bound in sched_bind() unbind it before
re-binding it to a new cpu. I don't like these semantics but they are
expected by some code in the tree. Patch by jkoshy.


# 166137 20-Jan-2007 jeff

- In tdq_transfer() always set NEEDRESCHED when necessary regardless of
the ipi settings. If NEEDRESCHED is set and an ipi is later delivered
it will clear it rather than cause extra context switches. However, if
we miss setting it we can have terrible latency.
- In sched_bind() correctly implement bind. Also be slightly more
tolerant of code which calls bind multiple times. However, we don't
change binding if another call is made with a different cpu. This
does not presently work with hwpmc which I believe should be changed.


# 166108 19-Jan-2007 jeff

Major revamp of ULE's cpu load balancing:
- Switch back to direct modification of remote CPU run queues. This added
a lot of complexity with questionable gain. It's easy enough to
reimplement if it's shown to help on huge machines.
- Re-implement the old tdq_transfer() call as tdq_pickidle(). Change
sched_add() so we have selectable cpu choosers and simplify the logic
a bit here.
- Implement tdq_pickpri() as the new default cpu chooser. This algorithm
is similar to Solaris in that it tries to always run the threads with
the best priorities. It is actually slightly more complex than
solaris's algorithm because we also tend to favor the local cpu over
other cpus which has a boost in latency but also potentially enables
cache sharing between the waking thread and the woken thread.
- Add a bunch of tunables that can be used to measure effects of different
load balancing strategies. Most of these will go away once the
algorithm is more definite.
- Add a new mechanism to steal threads from busy cpus when we idle. This
is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The
threshold is the required length of a tdq's run queue before another
cpu will be able to steal runnable threads. This prevents most queue
imbalances that contribute the long latencies.


# 165830 06-Jan-2007 jeff

- Don't let SCHED_TICK_TOTAL() return less than hz. This can cause integer
divide faults in roundup() later if it is able to return 0. For some
reason this bug only shows up on my laptop and not my testboxes.


# 165827 06-Jan-2007 jeff

- Fix the sched_priority() invalid priority bugs. Use roundup() instead
of max() when computing the divisor in SCHED_TICK_PRI(). This prevents
cases where rounding down would allow the quotient to exceed
SCHED_PRI_RANGE.
- Garbage collect some unused flags and fields.
- Replace TDF_HOLD with sched_pin_td()/sched_unpin_td() since it simply
duplicated this functionality.
- Re-enable the rebalancer by default and fix the sysctl so it can be
modified.


# 165821 06-Jan-2007 jeff

- Don't IPI unless we're going to interrupt something exiting in the kernel.
otherwise we can afford the latency. This makes a significant performance
improvement.


# 165819 05-Jan-2007 jeff

- Fix a comparison in sched_choose() that caused cpus to be constantly
marked idle, thus breaking cpu load balancing.
- Change sched_interact_update() to fix cases where the stored history
has expanded significantly rather than handling them in the callers. This
fixes a case where sched_priority() could compute a bad value.
- Add a sysctl to disable the global load balancer for experimentation.


# 165796 05-Jan-2007 jeff

- ftick was initialized to -1 for init and any of it's children. Fix this by
setting ftick = ltick = ticks in schedinit().
- Update the priority when we are pulled off of the run queue and when we
are inserted onto the run queue so that it more accurately reflects our
present status. This is important for efficient priority propagation
functioning.
- Move the frequency test into sched_pctcpu_update() so we don't repeat it
each time we'd like to call it.
- Put some temporary work-around code in sched_priority() in case the tick
mechanism produces a bad priority. Eventually this should revert to an
assert again.


# 165766 04-Jan-2007 jeff

- Only allow the tdq_idx to increase by one each tick rather than up to
the most recently chosen index. This significantly improves nice
behavior. This allows a lower priority thread to run some multiple of
times before the higher priority thread makes it to the front of
the queue. A nice +20 cpu hog now only gets ~5% of the cpu when running
with a nice 0 cpu hog and about 1.5% with a nice -20 hog. A nice
difference of 1 makes a 4% difference in cpu usage between two hogs.
- Track a seperate insert and removal index. When the removal index is
empty it is updated to point at the current insert index.
- Don't remove and re-add a thread to the runq when it is being adjusted
down in priority.
- Pull some conditional code out of sched_tick(). It's looking a bit
large now.


# 165762 04-Jan-2007 jeff

ULE 2.0:
- Remove the double queue mechanism for timeshare threads. It was slow
due to excess cache lines in play, caused suboptimal scheduling behavior
with niced and other non-interactive processes, complicated priority
lending, etc.
- Use a circular queue with a floating starting index for timeshare threads.
Enforces fairness by moving the insertion point closer to threads with
worse priorities over time.
- Give interactive timeshare threads real-time user-space priorities and
place them on the realtime/ithd queue.
- Select non-interactive timeshare thread priorities based on their cpu
utilization over the last 10 seconds combined with the nice value. This
gives us more sane priorities and behavior in a loaded system as
compared to the old method of using the interactivity score. The
interactive score quickly hit a ceiling if threads were non-interactive
and penalized new hog threads.
- Use one slice size for all threads. The slice is not currently
dynamically set to adjust scheduling behavior of different threads.
- Add some new sysctls for scheduling parameters.

Bug fixes/Clean up:
- Fix zeroing of td_sched after initialization in sched_fork_thread() caused
by recent ksegrp removal.
- Fix KSE interactivity issues related to frequent forking and exiting of
kse threads. We simply disable the penalty for thread creation and exit
for kse threads.
- Cleanup the cpu estimator by using tickincr here as well. Keep ticks and
ltick/ftick in the same frequency. Previously ticks were stathz and
others were hz.
- Lots of new and updated comments.
- Many many others.

Tested on: up x86/amd64, 8way amd64.


# 165627 29-Dec-2006 jeff

- More search and replace prettying.


# 165620 29-Dec-2006 jeff

- Clean up a bit after the most recent KSE restructuring.


# 164939 06-Dec-2006 julian

Changes to try fix sched_ule.c courtesy of David Xu.


# 164936 06-Dec-2006 julian

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# 164091 08-Nov-2006 maxim

o Fix a couple of obvious typos.


# 163709 26-Oct-2006 jb

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# 161599 25-Aug-2006 davidxu

Add user priority loaning code to support priority propagation for
1:1 threading's POSIX priority mutexes, the code is no-op unless
priority-aware umtx code is committed.


# 159630 15-Jun-2006 davidxu

Add scheduler API sched_relinquish(), the API is used to implement
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.


# 159570 13-Jun-2006 davidxu

Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.


# 159337 06-Jun-2006 davidxu

Make ke_rqindex unsigned.


# 153749 27-Dec-2005 davidxu

Use variable i instead of variable cpus as an index to get correct kseq.


# 153533 19-Dec-2005 davidxu

Fix a bug in slice calculation code, current code uses hz but
sched_clock() is called by state clock.

Submitted by: taku at tackymt dot homeip dot net


# 150443 21-Sep-2005 davidxu

This is a forced commit. The previous change, revision 1.158 was intend to
fix the PR kern/86067.


# 150442 21-Sep-2005 davidxu

Temporarily disable nice threshold detection code, as it can starve
a thread holding critical resource, e.g mutex or other implicit
synchronous flags. Give thread which exceeds nice threshold a minimum
time slice.

PR: kern/86087


# 149278 19-Aug-2005 davidxu

Move up code for testing KEF_HOLD to avoid ke_cpu being changed unexpectly
for PRI_ITHD and PRI_REALTIME threads.


# 148856 08-Aug-2005 davidxu

Try best to keep a preempted thread at front of run queue, this seems
improved performance a bit for some workloads, but still seeing interactive
lagging unless cpu idling race is fixed.


# 148603 31-Jul-2005 davidxu

If a thread was removed from system run queue, kse_assign shouldn't
add it again.


# 148383 25-Jul-2005 delphij

Cast to uintptr_t when the compiler complains. This unbreaks ULE
scheduler breakage accompanied by the recent atomic_ptr() change.


# 147565 23-Jun-2005 peter

Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit
being in opt_global.h and forcing a global recompile when only a few files
reference it.

Approved by: re


# 147068 07-Jun-2005 jeff

- Fix the case where we're not preempting but there is already a newtd
as this happens via thread_switchout(). I don't particularly like the
structure of the code here. We twice call out to thread code when
a thread is voluntarily switching. Once to thread_switchout() and once
to slot_fill(), while sched_4BSD does even more work which is redundant
to select another thread to use our remaining slice. This should be
simplified in the future, but for now I'm only going to fix the bug not
the bad design.


# 146955 04-Jun-2005 jeff

- It's 2005 already, I've been working on this for three years.


# 146954 04-Jun-2005 jeff

- Don't SLOT_USE() in the preempt case, sched_add() has already taken the
slot for us. Previously, we would take two slots on every preempt, and
setrunqueue() would fix it up for us in the non threaded case. The
threaded case was simply broken.
- Clean up flags, prototypes, comments.


# 145256 19-Apr-2005 jkoshy

Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities
and documentation into -CURRENT.

Bump FreeBSD_version.

Reviewed by: alc, jhb (kernel changes)


# 144777 08-Apr-2005 ups

Sprinkle some volatile magic and rearrange things a bit to avoid race
conditions in critical_exit now that it no longer blocks interrupts.

Reviewed by: jhb


# 142273 22-Feb-2005 jeff

- A test in sched_switch() is no longer necessary and it is incorrect
when td0 is preempted before it voluntarily switches.

Discovered by: Arjan Van Leeuwen <avleeuwen@gmail.com>


# 141292 04-Feb-2005 jeff

- Add ke_runq == NULL to the conditions which will cause us to abort
adjusting timeshare loads in sched_class(). This is only important if
the thread has never run, otherwise the state checks should work as
expected.


# 139455 30-Dec-2004 jhb

Fix a typo and two whitespace nits.


# 139453 30-Dec-2004 jhb

Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.

Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.

Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64


# 139336 26-Dec-2004 jeff

- Unintentionally checked in a debugging panic. Remove that.


# 139334 26-Dec-2004 jeff

- Fix a long standing problem where an ithread would not honor sched_pin().
- Remove the sched_add wrapper that used sched_add_internal() as a backend.
Its only purpose was to interpret one flag and turn it into an int. Do
the right thing and interpret the flag in sched_add() instead.
- Pass the flag argument to sched_add() to kseq_runq_add() so that we can
get the SRQ_PREEMPT optimization too.
- Add a KEF_INTERNAL flag. If KEF_INTERNAL is set we don't adjust the SLOT
counts, otherwise the slot counts are adjusted as soon as we enter
sched_add() or sched_rem() rather than when the thread is actually placed
on the run queue. This greatly simplifies the handling of slots.
- Remove the explicit prevention of migration for ithreads on non-x86
platforms. This was never shown to have any real benefit.
- Remove the unused class argument to KSE_CAN_MIGRATE().
- Add ktr points for thread migration events.
- Fix a long standing bug on platforms which don't initialize the cpu
topology. The ksg_maxid variable was never correctly set on these
platforms which caused the long term load balancer to never inspect
more than the first group or processor.
- Fix another bug which prevented the long term load balancer from working
properly. If stathz != hz we can't expect sched_clock() to be called
on the exact tick count that we're anticipating.
- Rearrange sched_switch() a bit to reduce indentation levels.


# 139316 25-Dec-2004 jeff

- Remove earlier KTR_ULE tracepoints.
- Define new KTR_SCHED points so that we can graph the operation of the
scheduler.


# 138843 14-Dec-2004 jeff

- Garbage collect several unused members of struct kse and struce ksegrp.
As best as I can tell, some of these were never used.


# 138842 14-Dec-2004 jeff

- In kseq_choose(), don't recalculate slice values for processes with a
nice of 0. Doing so can cause an infinite loop because they should be
running, but a nice -20 process could prevent them from doing so.
- Add a new flag KEF_PRIOELEV to flag a thread that has had its priority
elevated due to priority propagation. If a thread has had its priority
elevated, we assume that it must go on the current queue and it must
get a slice.
- In sched_userret() if our priority was elevated and we shouldn't have
a timeslice, yield here until we should.

Found/Tested by: glebius


# 138802 13-Dec-2004 jeff

- Take up a 'slot' while we're on the assigned queue, waiting to be
posted to another processor. Otherwise, kern_switch() gets confused
and tries to sched_add(NULL).


# 137588 11-Nov-2004 jeff

- Temporarily disable the nice -20 throttling code. It has some interaction
with APM that I do not understand yet.

Reported & Tested by: glebius


# 137067 30-Oct-2004 jeff

- When choosing a thread on the run queue, check to see if its nice is
outside of the nice threshold due to a recently awoken thread with a
lower nice value. This further reduces the amount of time a positively
niced thread gets while running in conjunction with a workload that has
many short sleeps (ie buildworld).


# 137061 30-Oct-2004 jeff

- In sched_prio() check to see if the kse is assigned to a runq as the
check for TD_ON_RUNQ() no longer means the thread is really on a run-
queue. I suspect this state should be re-evaluated as it must mean
something else now. This fixes ULE+KSE+PREEMPTION on UP x86.


# 136173 05-Oct-2004 julian

Fix whitespace botch that only showed up in the commit message diff :-/

MFC after: 4 days


# 136170 05-Oct-2004 julian

When preempting a thread, put it back on the HEAD of its run queue.
(Only really implemented in 4bsd)

MFC after: 4 days


# 136169 05-Oct-2004 julian

Oops. left out part of the diff.

MFC after: 4 days


# 136167 05-Oct-2004 julian

Use some macros to trach available scheduler slots to allow
easier debugging.

MFC after: 4 days


# 135295 16-Sep-2004 julian

clean up thread runq accounting a bit.

MFC after: 3 days


# 135076 11-Sep-2004 scottl

Revert the previous round of changes to td_pinned. The scheduler isn't
fully initialed when the pmap layer tries to call sched_pini() early in the
boot and results in an quick panic. Use ke_pinned instead as was originally
done with Tor's patch.

Approved by: julian


# 135059 10-Sep-2004 julian

Try committing from the right tree this time
MFC after: 2 days


# 135056 10-Sep-2004 julian

Make up my mind if cpu pinning is stored in the thread structure or the
scheduler specific extension to it. Put it in the extension as
the implimentation details of how the pinning is done needn't be visible
outside the scheduler.

Submitted by: tegge (of course!) (with changes)
MFC after: 3 days


# 135051 10-Sep-2004 julian

Add some code to allow threads to nominat a sibling to run if theyu are going to sleep.

MFC after: 1 week


# 134791 05-Sep-2004 julian

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 134649 02-Sep-2004 scottl

Turn PREEMPTION into a kernel option. Make sure that it's defined if
FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c). This
is a possible MT5 candidate.


# 134586 01-Sep-2004 julian

Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after: 2 days


# 134415 27-Aug-2004 peter

Commit Jeff's suggested changes for avoiding a bug that is exposed by
preemption and/or the rev 1.79 kern_switch.c change that was backed out.

The thread was being assigned to a runq without adding in the load, which
would cause the counter to hit -1.


# 133555 12-Aug-2004 jeff

- Introduce a new flag KEF_HOLD that prevents sched_add() from doing a
migration. Use this in sched_prio() and sched_switch() to stop us from
migrating threads that are in short term sleeps or are runnable. These
extra migrations were added in the patches to support KSE.
- Only set NEEDRESCHED if the thread we're adding in sched_add() is a
lower priority and is being placed on the current queue.
- Fix some minor whitespace problems.


# 133427 10-Aug-2004 jeff

- Use a new flag, KEF_XFERABLE, to record with certainty that this kse had
contributed to the transferable load count. This prevents any potential
problems with sched_pin() being used around calls to setrunqueue().
- Change the sched_add() load balancing algorithm to try to migrate on
wakeup. This attempts to place threads that communicate with each other
on the same CPU.
- Don't clear the idle counts in kseq_transfer(), let the cpus do that when
they call sched_add() from kseq_assign().
- Correct a few out of date comments.
- Make sure the ke_cpu field is correct when we preempt.
- Call kseq_assign() from sched_clock() to catch any assignments that were
done without IPI. Presently all assignments are done with an IPI, but I'm
trying a patch that limits that.
- Don't migrate a thread if it is still runnable in sched_add(). Previously,
this could only happen for KSE threads, but due to changes to
sched_switch() all threads went through this path.
- Remove some code that was added with preemption but is not necessary.


# 132776 28-Jul-2004 kan

Avoid casts as lvalues.


# 132589 23-Jul-2004 scottl

Clean up whitespace, increase consistency and correctness.

Submitted by: bde


# 132372 18-Jul-2004 julian

When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter


# 132266 16-Jul-2004 jhb

- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).


# 131929 10-Jul-2004 marcel

Update for the KDB framework:
o Call kdb_backtrace() instead of backtrace().


# 131839 08-Jul-2004 jhb

- Move contents of sched_add() into a sched_add_internal() function that
takes an argument to specify if it should preempt or not. Don't preempt
when sched_add_internal() is called from kseq_idled() or kseq_assign()
as in those cases we are about to call mi_switch() anyways. Also, doing
so during the first context switch on an AP leads to a NULL pointer deref
because curthread is NULL.
- Reenable preemption for ULE.

Submitted by: Taku YAMAMOTO taku at tackymt.homeip.net


# 131678 06-Jul-2004 rwatson

Temporarily disable preemption in SCHED_ULE due to reported panics and
hangs due to recent preemption changes. This change appears to remove
the panic that I was running into, but at the cost of increasing
ithread scheduling latency, and as such is a temporary band-aid until
jhb has a chance to resolve the ule<->preemption interaction that is
the source of the problem. If it doesn't fix the problem for others--
sorry!


# 131527 03-Jul-2004 phk

Add NULL arg to mi_switch() call to stop kernel compiles from breaking.


# 131510 02-Jul-2004 bmilekic

Fix SCHED_ULE build on SMP. The previous revision (1.110)
introduced a KSE_CAN_MIGRATE() invocation with one argument
missing (class). Either this is a genuine forget or it crept
in from JHB's repo where he may have modified it. If it's
the latter then it may require more attention. For now fix
the make depend.


# 131481 02-Jul-2004 jhb

Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by: scottl (with his re@ hat)


# 131473 02-Jul-2004 jhb

- Change mi_switch() and sched_switch() to accept an optional thread to
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.


# 130881 21-Jun-2004 scottl

Add the sysctl node 'kern.sched.name' that has the name of the scheduler
currently in use. Move the 4bsd kern.quantum node to kern.sched.quantum
for consistency.


# 130551 15-Jun-2004 julian

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


# 129982 02-Jun-2004 jeff

- Run sched_balance() and sched_balance_groups() from hardclock via
sched_clock() rather than using callouts. This means we no longer have to
take the load of the callout thread into consideration while balancing and
should make the balancing decisions simpler and more accurate.

Tested on: x86/UP, amd64/SMP


# 128563 22-Apr-2004 obrien

There was a thread on "unusually high load averages" when running under
sched_ule, in January 2004. Looking at this, "pagezero" is (one of) the
culprit(s). We had no provision for processes with P_NOLOAD set. With
pagezero not running at PRI_ITHD, kseq_load_{add,rem} count pagezero as
another-normal-process, thus the "expected-plus-one" load reported in
the above thread.

Submitted by: Nikos Ntarmos <ntarmos@ceid.upatras.gr>


# 128055 09-Apr-2004 cognet

Spell "switches" a more conventional way.


# 127850 04-Apr-2004 jeff

- Use the proper constant in sched_interact_update(). Previously,
SCHED_INTERACT_MAX was used where SCHED_SLP_RUN_MAX was needed. This was
causing the interactivity scaler to lose history at a more dramatic rate
than intended.


# 127498 27-Mar-2004 marcel

Change the type of the various CPU masks to cpumask_t. Note that as
long as there are still explicit uses of int, whether in types or
in function names (such as atomic_set_int() in sched_ule.c), we can
not change cpumask_t to be anything other than u_int. See also the
commit log for sys/sys/types.h, revision 1.84.


# 127278 21-Mar-2004 obrien

Give a more reasonable CPU time to the threads which are using scheduler
activation (i.e., applications are using libpthread). This is because
SCHED_ULE sometimes puts P_SA processes into ksq_next unnecessarily.
Which doesn't give fair amount of CPU time to processes which are
using scheduler-activation-based threads when other (semi-)CPU-intensive,
non-P_SA processes are running.

Further work will no doubt be done by jeffr at a later date.

Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Reviewed by: rwatson, freebsd-current@


# 126326 27-Feb-2004 jhb

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 125299 01-Feb-2004 jeff

- Allow interactive tasks to use the maximum time-slice. This is not as
detrimental as I thought it would be in the case of massive process
storms from a shell and it makes regular desktop usage noticeably
better.


# 125289 01-Feb-2004 jeff

- Add a new member to struct kseq called ksq_sysload. This is intended to
track the load for the sched_load() function. In the SMP case this member
is not defined because it would be redundant with the ksg_load member
which already tracks the non ithd load.
- For sched_load() in the UP case simply return ksq_sysload. In the SMP
case traverse the list of kseq groups and sum up their ksg_load fields.


# 124959 25-Jan-2004 jeff

- sched_strict has been dead for a long time now. Get rid of it.


# 124958 25-Jan-2004 jeff

- Clean up KASSERTS.


# 124944 25-Jan-2004 jeff

- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.


# 123694 20-Dec-2003 jeff

- Make our transfer decisions based on load and not transferable load. A
cpu could have been bogged down with non-transferable load and still not
migrated a new thread to an idle cpu. This required some benchmarking and
tuning to get right as the comment above it suggests.


# 123693 20-Dec-2003 jeff

- Enable ithread migration on x86. This is done to work around a bug in the
IO APIC on Xeons that prevents round-robin interrupt assignment from
working.


# 123685 20-Dec-2003 jeff

- In kseq_transfer() return if smp has not been started.
- In sched_add(), do the idle check prior to the transfer check so that we
don't try to transfer load from an idle cpu. This fixes panics caused by
IPIs on UP machines running SMP kernels.

Reported/Debugged by: seanc


# 123684 20-Dec-2003 jeff

- Running interactive tasks with the minimum time-slice is fine for vi and
sh, but not so great for mozilla, X, etc. Add a fixed define for the slice
size granted to interactive KSEs.


# 123529 14-Dec-2003 jeff

- Assign the ke_cpu field in kseq_notify() so that all of our callers do not
have to do it.
- Set the ke_runq to NULL in sched_add() before calling kseq_notify().
Otherwise we may panic in sched_add() if INVARIANTS is on.


# 123487 12-Dec-2003 jeff

- Now that we have kseq groups, balance them seperately.
- The new sched_balance_groups() function does intra-group balancing while
sched_balance() balances the available groups.
- Pick a random time between 0 ticks and hz * 2 ticks to restart each
balancing process. Each balancer has its own timeout.
- Pick a random place in the list of groups to start the search for lowest
and highest group loads. This prevents us from prefering a group based on
numeric position.
- Use a nasty hack to stop us from preferring cpu 0. The problem is that
softclock always runs on cpu 0, so it always has a little extra load. We
ignore this load in the balancer for now. In the future softclock should
run on a random cpu and these hacks can go away.


# 123435 11-Dec-2003 jeff

- Don't let the pctcpu rate limiter throttle us if we have recorded over
SCHED_CPU_TICKS ticks. This was allowing processes to display
(1/SCHED_CPU_TIME * 100) % more cpu than they had used.


# 123434 11-Dec-2003 jeff

- In sched_switch(), if a thread has been assigned, don't touch the runqueues
or load. These things have already been taken care of in sched_bind()
which should be the only place that we're switching in an assigned thread.


# 123433 11-Dec-2003 jeff

- Add support for CPU groups to ule. All SMT cores on the same physical
cpu are added to a group.
- Don't place a cpu into the kseq_idle bitmask until all cpus in that group
have idled.
- Prefer idle groups over idle group members in the new kseq_transfer()
function. In this way we will prefer to balance load across full cores
rather than add further load a partial core.
- Before a cpu goes idle, check the other group members for threads. Since
SMT cpus may freely share threads, this is cheap.
- SMT cores may be individually pinned and bound to now. This contrasts the
old mechanism where binding or pinning would have allowed a thread to run
on any available cpu.
- Remove some unnecessary logic from sched_switch(). Priority propagation
should be properly taken care of in sched_prio() now.


# 123231 07-Dec-2003 peter

rqb_bits[] may be an int64_t (eg: on alpha, and recently on amd64).
Be sure to shift (long)1 << 33 and higher, not (int)1. Otherwise bad
things happen(TM). This is why beast.freebsd.org paniced with ULE.

Reviewed by: jeff


# 123126 03-Dec-2003 jhb

Fix all users of mp_maxid to use the same semantics, namely:

1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by: re (scottl)
Tested on: i386, amd64, alpha


# 122848 17-Nov-2003 jeff

- Mark ksq_assigned as volatile so that when this code is used without
sched_lock we can be sure that we'll pick up the new value.


# 122847 17-Nov-2003 jeff

- Remove long dead code. rslices hasn't been used in some time and neither
has sched_pickcpu().


# 122744 15-Nov-2003 jeff

- Introduce kseq_runq_{add,rem}() which are used to insert and remove
kses from the run queues. Also, on SMP, we track the transferable
count here. Threads are transferable only as long as they are on the
run queue.
- Previously, we adjusted our load balancing based on the transferable count
minus the number of actual cpus. This was done to account for the threads
which were likely to be running. All of this logic is simpler now that
transferable accounts for only those threads which can actually be taken.
Updated various places in sched_add() and kseq_balance() to account for
this.
- Rename kseq_{add,rem} to kseq_load_{add,rem} to reflect what they're
really doing. The load is accounted for seperately from the runq because
the load is accounted for even as the thread is running.
- Fix a bug in sched_class() where we weren't properly using the PRI_BASE()
version of the kg_pri_class.
- Add a large comment that describes the impact of a seemingly simple
conditional in sched_add().
- Also in sched_add() check the transferable count and KSE_CAN_MIGRATE()
prior to checking kseq_idle. This reduces the frequency of access for
kseq_idle which is a shared resource.


# 122165 06-Nov-2003 jeff

- Somehow I botched my last commit. Add an extra ( to fix things up. I'm
still not sure how this happened.

Reported by: ps


# 122158 06-Nov-2003 jeff

- Remove the local definition of sched_pin and unpin. They are provided in
sched.h now.
- Respect the td pin count.


# 122094 05-Nov-2003 jeff

- It's ok if sched_runnable() has races in it, we don't need the sched_lock
here unless we have something on the assigned queue.


# 122038 04-Nov-2003 jeff

- Add initial support for pinning and binding.


# 121923 03-Nov-2003 jeff

- Remove kseq_find(), we no longer scan other cpu's run queues when we go
idle. They figure out that we're idle fast enough that the cache pollution
introduces by scanning their run queue is more expensive than waiting
a little longer.
- Add kseq_setidle() to mark us as being idle. Use this in place of
kseq_find().
- Remove kseq_load_highest(), kseq_find() was the only consumer of this
interface. kseq_balance() has it's own customized version that finds the
lowest and highest loads simultaneously.

Continuously told that this would be faster by: terry


# 121896 02-Nov-2003 jeff

- Remove the ksq_loads[] array. We are only interested in three counts,
the total load, the timeshare load, and the number of threads that can
be migrated to another cpu. Account for these seperately.
- Introduce a KSE_CAN_MIGRATE() macro which determines whether or not a KSE
can be migrated to another CPU. Currently, this only checks to see if
we're an interrupt handler. Eventually this will also be used to support
CPU binding.


# 121872 02-Nov-2003 jeff

- In sched_prio() only force us onto the current queue if our priority is
being elevated (numerically smaller).


# 121871 02-Nov-2003 jeff

- Rename SCHED_PRI_NTHRESH to SCHED_SLICE_NTHRESH since it is only used in
slice assignment. Add a comment describing what it does.
- Remove a stale XXX comment, the nice should not impact the interactivity,
nice adjustments only effect non-interactive tasks in ULE.
- Don't allow nice -20 tasks to totally starve nice 0 tasks. Give them at
least SCHED_SLICE_MIN ticks. We still allow nice 0 tasks to starve nice
+20 tasks as intended.


# 121869 02-Nov-2003 jeff

- Remove uses of PRIO_TOTAL and replace them with SCHED_PRI_NRESV
- SCHED_PRI_NRESV does not have the off by one error in PRIO_TOTAL so we
do not have to account for it in the few places that we use it.

Requested by: bde


# 121868 02-Nov-2003 jeff

- Change sched_interact_update() to only accept slp+runtime values between
0 and SCHED_SLP_RUN_MAX * 2. This allows us to simplify the algorithm
quite a bit. Before, it dealt with arbitrary values which required us
to do nasty integer division tricks that didn't quite work out correctly.
- Chnage sched_wakeup() to detect conditions where the slp+runtime could
exceed SCHED_SLP_RUN_MAX * 2. This can happen if we go to sleep for
longer than 6 seconds. In this case, we'll just clear the runtime and
set the sleep time to the max.
- Define a new function, sched_interact_fork() which updates the slp+runtime
of a newly forked thread. We want to limit the amount of history retained
from the parent so that we learn the child's behavior quickly. We don't,
however want to decay it to nothing. Previously, we would simply divide
each parameter by 100 whenever we forked. After a few forks the values
would reach 0 and tasks would not be considered interactive.
- Add another KTR entry, cleanup some existing entries.
- Remove a useless sched_interact_update() from sched_priority(). This is
already done by the callers that require it.


# 121790 31-Oct-2003 jeff

- Add static to local functions and data where it was missing.
- Add an IPI based mechanism for migrating kses. This mechanism is
broken down into several components. This is intended to reduce cache
thrashing by eliminating most cases where one cpu touches another's
run queues.
- kseq_notify() appends a kse to a lockless singly linked list and
conditionally sends an IPI to the target processor. Right now this is
protected by sched_lock but at some point I'd like to get rid of the
global lock. This is why I used something more complicated than a
standard queue.
- kseq_assign() processes our list of kses that have been assigned to us
by other processors. This simply calls sched_add() for each item on the
list after clearing the new KEF_ASSIGNED flag. This flag is used to
indicate that we have been appeneded to the assigned queue but not
added to the run queue yet.
- In sched_add(), instead of adding a KSE to another processor's queue we
use kse_notify() so that we don't touch their queue. Also in sched_add(),
if KEF_ASSIGNED is already set return immediately. This can happen if
a thread is removed and readded so that the priority is recorded properly.
- In sched_rem() return immediately if KEF_ASSIGNED is set. All callers
immediately readd simply to adjust priorites etc.
- In sched_choose(), if we're running an IDLE task or the per cpu idle thread
set our cpumask bit in 'kseq_idle' so that other processors may know that
we are idle. Before this, make a single pass through the run queues of
other processors so that we may find work more immediately if it is
available.
- In sched_runnable(), don't scan each processor's run queue, they will IPI
us if they have work for us to do.
- In sched_add(), if we're adding a thread that can be migrated and we have
plenty of work to do, try to migrate the thread to an idle kseq.
- Simplify the logic in sched_prio() and take the KEF_ASSIGNED flag into
consideration.
- No longer use kseq_choose() to steal threads, it can lose it's last
argument.
- Create a new function runq_steal() which operates like runq_choose() but
skips threads based on some criteria. Currently it will not steal
PRI_ITHD threads. In the future this will be used for CPU binding.
- Create a kseq_steal() that checks each run queue with runq_steal(), use
kseq_steal() in the places where we used kseq_choose() to steal with
before.


# 121682 29-Oct-2003 bde

Removed sched_nest variable in sched_switch(). Context switches always
begin with sched_lock held but not recursed, so this variable was
always 0.

Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch(). Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up. Only
sched_lock.mtx_lock really needs a fixup.

Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up. This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.


# 121625 28-Oct-2003 jeff

- Only change the run queue in sched_prio() if the kse is non null. threads
can be in the TD_ON_RUNQ state and not have an associated kse.
- Remove the PRI_IDLE special case from sched_clock(), it was not actually
necessary.


# 121605 27-Oct-2003 jeff

- Use a better algorithm in sched_pctcpu_update()

Contributed by: Thomaswuerfl@gmx.de

- In sched_prio(), adjust the run queue for threads which may need to move
to the current queue due to priority propagation .
- In sched_switch(), fix style bug introduced when the KSE support went in.
Columns are 80 chars wide, not 90.
- In sched_switch(), Fix the comparison in the idle case and explicitly
re-initialize the runq in the not propagated case.
- Remove dead code in sched_clock().
- In sched_clock(), If we're an IDLE class td set NEEDRESCHED so that threads
that have become runnable will get a chance to.
- In sched_runnable(), if we're not the IDLETD, we should not consider
curthread when examining the load. This mimics the 4BSD behavior of
returning 0 when the only runnable thread is running.
- In sched_userret(), remove the code for setting NEEDRESCHED entirely.
This is not necessary and is not implemented in 4BSD.
- Use the correct comparison in sched_add() when checking to see if an idle
prio task has had it's priority temporarily elevated.


# 121290 20-Oct-2003 jeff

- If a thread is not bound to a kse return 0 from sched_pctcpu().

Reported by: pawel.worach@nordea.com


# 121146 16-Oct-2003 jeff

- Only kse_reassign() in the !running case.

Reported by: kris


# 121145 16-Oct-2003 jeff

- Call sched_add() with the correct argument on SMP.

Reported by: Valentin Chopov <valentin@valcho.net>


# 121132 16-Oct-2003 jeff

- Fix a minor problem with my last commit, we don't want to return from
sched_switch if the thread is running, we want to fall through and pick
a new thread because we have been preempted.


# 121128 16-Oct-2003 jeff

- Collapse sched_switchin() and sched_switchout() into sched_switch(). Now
mi_switch() calls sched_switch() which calls cpu_switch(). This is
actually one less function call than it had been.


# 121127 16-Oct-2003 jeff

- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td
argument rather than a kse.


# 121126 16-Oct-2003 jeff

- The non iterative algorithm for interact_update was broken due to
rounding errors. This was the source of the majority of the
interactivity problems. Reintroduce the old algorithm and its XXX.
- Up the interactivity threshold to 30. It really could stand to be even
a tiny bit higher.
- Let the sleep and run time accumulate up to 5 seconds of history rather
than two. This helps stop XFree86 from becoming non-interactive during
bursts of activity.


# 121107 15-Oct-2003 jeff

- If our user_pri doesn't match our actual priority our priority has been
elevated either due to priority propagation or because we're in the
kernel in either case, put us on the current queue so that we dont
stop others from using important resources. At some point the priority
elevations from sleeping in the kernel should go away.
- Remove an optimization in sched_userret(). Before we would only set
NEEDRESCHED if there was something of a higher priority available. This
is a trivial optimization and it breaks priority propagation because it
doesn't take threads which we may be blocking into account. Notice that
the thread which is blocking others gets up to one tick of cpu time before
we honor this NEEDRESCHED in sched_clock().


# 121051 12-Oct-2003 jeff

- In SCHED_CURR() add holding Giant to the list of criteria that will keep
you on the current queue. In the future, it would be nice if priority
propagation could deterministicly pluck a thread off of the next queue
and put it on the current queue. Until then this hack stops us from
holding up our entire current queue, including interrupt handlers, while
a thread on the next queue is blocked while holding Giant.
- Inherit our pctcpu information from our parent.


# 120754 04-Oct-2003 jeff

- Change a lame iterative algorithm to a constant time algorithm. Remove
the XXX that complains about it as well.

Submitted by: ThomasWuerfl@gmx.de


# 120272 20-Sep-2003 jeff

- Somewhere along the line I stupidly removed critical logic from
sched_ptcpu_update(). This caused erroneous cpu times in TOP for
processes that were asleep. Replace the code that was removed.


# 119488 26-Aug-2003 davidxu

Let SA process work under ULE scheduler, originally it would panic kernel.

Reviewed by: jeff


# 119137 19-Aug-2003 sam

Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter. This
does not change the behaviour; it just makes the intent more clear.


# 117326 08-Jul-2003 jeff

- When stealing a kse in kseq_move() ignore the current kseq's min nice
value. We want to steal any thread, even one that is not given a slice
on its current queue.


# 117313 07-Jul-2003 jeff

- Clean up an unused variable.

Submitted by: Steve Kargl <skg@routmask.apl.washington.edu>


# 117237 04-Jul-2003 jeff

- Parse the cpu topology map in sched_setup().
- Associate logical CPUs on the same physical core with the same kseq.
- Adjust code that assumed there would only be one running thread in any
kseq.
- Wrap the HTT code with a ULE_HTT_EXPERIMENTAL ifdef. This is a start
towards HyperThreading support but it isn't quite there yet.


# 116970 28-Jun-2003 jeff

- Don't migrate to stopped cpus.


# 116962 28-Jun-2003 jeff

- If smp is not started yet don't try to load balance or we'll put threads
on cpus that aren't running yet.


# 116955 28-Jun-2003 jeff

- Throttle the inherited sleep and run time in sched_fork_kseg(). This
allows us to learn the behavior of a thread much more quickly after it
starts up.


# 116946 28-Jun-2003 jeff

- Adjust the default maximum slice value to ~140ms. This has improved the
nice distribution without significantly impacting interactive response.
As a side effect it should also allow batch processes to run for a
slightly longer period which will positively impact their performance.


# 116643 21-Jun-2003 jeff

- lticks was erroneously being updated in sched_pctcpu(). This was causing
us to skip the pctcpu_update() call which lead to inaccurate cpu usage
statistics for processes that didn't run often.


# 116642 21-Jun-2003 jeff

- Don't allow nice to have such a large effect on priority. This was
causing poor interactive performance while unnice processes were running.
The new scheme still allows nice to have an effect on priority but it is
not as dramatic as the effect of the interactivity score.


# 116500 17-Jun-2003 jeff

- Use a more robust mechanism for determining whether or not a kse is on a
kseq.


# 116479 17-Jun-2003 jeff

- Temporarily patch a problem where the interact score could be negative
because the run time exceeds the largest value a signed int can hold.
The real solution involves calculating how far we are over the limit.
To quickly solve this problem we loop removing 1/5th of the current value
until it falls below the limit. The common case requires no passes.


# 116463 17-Jun-2003 jeff

- Add a new function "sched_interact_update()" that scales back the sleep
and run time.
- Scale the sleep and run time back via sched_interact_update() in more
places. This is to keep the statistic more accurate.
- Charge a parent one tick for forking a child.
- Add only the run time and not the sleep time to the parents kg when a
thread exits. This allows us to give a penalty for having an expensive
thread exit but does not give a bonus for having an interactive thread
exit.
- Change the SLP_RUN_THROTTLE to limit us to 4/5th and not 1/2.
- Change the SLP_RUN_MAX to two seconds. This keeps bursty interactive
applications like mozilla and openoffice in the interactive range even
through expensive tasks.
- Recalculate the slice after every sleep. This ensures that once a task
has been marked interactive it only has a slice of 1 at the risk of
giving tasks that sleep for a very brief period a longer time slice.


# 116377 15-Jun-2003 jeff

- Increase the ksegrp's cpu time history buffer to 250ms.
- Decrease the history buffer divisor to 2 so that we remember more of the
old behavior.


# 116368 15-Jun-2003 jeff

- Cap the growth of sleep and run time in sched_exit_kse().


# 116365 15-Jun-2003 jeff

- Fix the maximum slice value. I accidentally checked in a value of '2'
which meant no process would run for longer than 20ms.
- Slightly redo the interactivity scorer. It follows the same algorithm but
in a slightly more correct way. Previously values above half were
incorrect.
- Lower the interactivity threshold to 20. It seems that in testing non-
interactive tasks are hardly ever near there and expensive interactive
tasks can sometimes surpass it. This area needs more testing.
- Remove an unnecessary KTR.
- Fix a case where an idle thread that had an elevated priority due to
priority prop. would be placed back on the idle queue.
- Delay setting NEEDRESCHED until userret() for threads that haad their
priority elevated while in kernel. This gives us the same context switch
optimization as SCHED_4BSD.
- Limit the child's slice to 1 in sched_fork_kse() so we detect its behavior
more quickly.
- Inhert some of the run/slp time from the child in sched_exit_ksegrp().
- Redo some of the priority comparisons so they are more clear.
- Throttle the frequency of sched_pctcpu_update() so that rounding errors
do not make it invalid.


# 116361 14-Jun-2003 davidxu

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 116069 08-Jun-2003 jeff

- Add a simple CPU load balancing algorithm. This works by executing once a
second and equalizing the load between the two most imbalanced CPU. This
is intended to clear up long term load imbalances that would not be handled
by the 'pull' method in sched_choose().
- Pull out some bits of sched_choose() into a kseq_move() function that moves
an arbitrary thread from one kseq to another.


# 115998 07-Jun-2003 jeff

- When a new thread is added to a kseq the load is incremented prior to
adding it to the nice tables. Therefore, in kseq_add_nice, we should
keep in mind that the load will be 1 if we are the only thread, and not
0.
- Assert that the sched lock is held in all the appropriate places.
- Increase the scope of the sched lock in sched_pctcpu_update().
- Hold the sched lock in sched_runnable(). It is not held by the caller.


# 114496 02-May-2003 julian

Fix typo in last commit


# 114471 01-May-2003 julian

Move the flag that indicates an idle thread from the KSE to the thread.
It was always referenced via the thread anyhow.

Reviewed by: jhb (a LOOOOONG time ago)


# 113923 23-Apr-2003 jhb

Add lock assertions for various proc/thread/kse/ksegroup fields to the
scheduler functions.


# 113873 22-Apr-2003 jhb

- Assert that the proc lock and sched_lock are held in sched_nice().
- For the 4BSD scheduler, this means that all callers of the static
function resetpriority() now always hold sched_lock, so don't lock
sched_lock explicitly in that function.


# 113865 22-Apr-2003 jhb

Protect p_swtime with the sched_lock.


# 113660 18-Apr-2003 jeff

- Set the ke_cpu field in sched_add() for interrupt and realtime threads
since they are going on the current cpu and not their previously assigned
cpu.
- sched_runnable() should only return true in the SMP case if the other
processor has more than one thread that is runnable. We can not steal
curthread.
- Change kseq_print() to accept the cpuid instead of a kseq pointer. This
makes use of this function in ddb much easier.


# 113417 12-Apr-2003 jeff

- Unbreak priority prop. for timeshare threads. Always place something on
the current queue if its priority is really elevated. This needs more work
as there are cases where a next queue kse could be holding up what would
be a curr queue kse, and thus hurting interactivity. Also, when a thread
with an elevated priority has its priority lowered it should be placed
back on the next queue.


# 113387 12-Apr-2003 jeff

- Clean up some debug code left over from my earlier megacommit.


# 113386 12-Apr-2003 jeff

- We only care about the base priority. Ignore the SCHED_FIFO_BIT so that
we dont get confused.

Reported and debugged by: Steve Kargl <sgk@troutmask.apl.washington.edu>


# 113372 11-Apr-2003 jeff

- Add sched_exit_*
- Call sched_exit_kse() from sched_exit() instead of implementing it here.


# 113371 11-Apr-2003 jeff

- Only select kseqs with more than one kse to steal. The running kse
is reflected in the load now and you can't very well migrate that.


# 113370 11-Apr-2003 jeff

- When migrating a kse from one kseq to the next actually insert it onto
the second kseq's run queue so that it is referenced by the kse when
it is switched out.
- Spell ksq_rslices properly.

Reported by: Ian Freislich <ianf@za.uu.net>


# 113357 11-Apr-2003 jeff

- Add a SYSCTL node for the ule scheduler.
- Allow user adjustable min and max time slices (suggested by hiten).
- Change the SLP_RUN_MAX to 100ms from 2 seconds so that we learn whether a
process is interactive or not much more quickly.
- Place a process on the current run queue if it is interactive or if it is
running at an interrupt thread priority due to priority prop.
- Use the 'current' timeshare queue for interrupt threads, realtime threads,
and idle threads that are running at higher priority due to priority prop.
This fixes problems where priorities would have been elevated but we would
not check the timeshare run queue until other lower priority tasks were
no longer runnable.
- Keep an array of loads indexed by the priority class as well as a global
load.
- Keep an bucket of nice values with a count of the number of kses currently
runnable with that nice value.
- Keep track of the minimum nice value of any running thread.
- Remove the unused short term sleep accounting. I was attempting to use
this for load balancing but it didn't work out.
- Define a kseq_print() for use with debugging.
- Add KTR debugging at useful places so we can easily debug slice and
priority assignment.
- Decouple the runq assignment from the kseq assignment. kseq_add now keeps
track of statistics. This is done so that the nice and load is still
tracked for the currently running process. Previously if a niced process
was added while a non nice process was running the niced process would
still get a slice since it was not aware of the unnice process.
- Make adjustments for the sched api changes.


# 113339 10-Apr-2003 julian

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 112994 02-Apr-2003 jeff

- Keep seperate statistics and run queues for different scheduling classes.
- Treat each class specially in kseq_{choose,add,rem}. Let the rest of the
code be less aware of scheduling classes.
- Skip the interactivity calculation for non TIMESHARE ksegrps.
- Move slice and runq selection into kseq_add(). Uninline it now that it's
big.


# 112971 02-Apr-2003 jeff

- Make the interactivity calculator decay faster.
- Make the pcpu estimator update faster.


# 112970 02-Apr-2003 jeff

- I meant divide by two and not shift by two in SCHED_PRI_NHALF.


# 112966 02-Apr-2003 jeff

- Add in support for KSEs with 0 slice values on the run queue. If we try
to select a KSE with a slice of 0 we will update its slice and insert it
onto the next queue.
- Pass the KSE instead of the ksegrp into sched_slice(). This more
accurately reflects the behavior of the code. Slices are granted to kses.
- Add a function kseq_nice_min() which finds the smallest nice value
assigned to the kseg of any KSE on the queue.
- Rewrite the logic in sched_slice(). Add a large comment describing the
new slice selection scheme. To summarize, slices are assigned based on
the nice value. Priorities are still calculated based on the nice and
interactivity of a process. Slice sizes of 0 may be granted for KSEs
whos nice is 20 or futher away from the lowest nice on the run queue.
Other nice values are scaled across the range [min, min+20]. This fixes
ULEs bad behavior with positively niced processes.


# 111857 04-Mar-2003 jeff

- Create a function sched_interact_score() which decides on the
interactivity of a kseg and assigns it a value of 0 through 100.
- Use sched_interact_score() to determine the dynamic priority.
- Define SCHED_CURR() in terms of sched_interact_score().
- Adjust the maximum slice back down to 100ms.
- Remove redundant clearing of ke_runq in sched_wakeup()
- Clean up #defines and comment them.


# 111793 03-Mar-2003 jeff

- Shift the tick count by 10 and back around sched_pctcpu_update()
calculations. Keep this changes local to the function so the tick count
is in its natural form otherwise. Previously 1000 was added each time
a tick fired and we divided by 1000 when it was reported. This is done
to reduce rounding errors.


# 111789 03-Mar-2003 jeff

- In sched_add() special case PRI_TIMESHARE and PRI_ITHD|PRI_REALTIME. We
always place ITHD & REALTIME threads on the current queue of the current
cpu. Prior to this change an interrupt thread would only ever run on one
cpu.


# 111788 03-Mar-2003 jeff

- Refrain from setting the td_priority in sched_wakeup(). It will be reset
before we return to user space.


# 111585 27-Feb-2003 julian

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 111032 17-Feb-2003 julian

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# 110646 10-Feb-2003 jeff

- Enable STRICT_RESCHED until code that dynamically decides on resched
strictness based on the current workload is finished.


# 110645 10-Feb-2003 jeff

- Add a new variable 'kg_runtime' that tracks the amount of time we've run.
- Use the ratio of kg_runtime / kg_slptime to determine our dynamic priority.
- Scale kg_runtime and kg_slptime back when the sum of the two exceeds
SCHED_SLP_RUN_MAX. This allows us to slowly forget old behavior.
- Scale back the runtime and slptime in fork so that the new process has the
same ratio but much less accumulated time. This causes new behavior to be
noticed more quickly.


# 110267 03-Feb-2003 jeff

- Make some context switches conditional on SCHED_STRICT_RESCHED. This may
have some negative effect on interactivity but it yields great perf. gains.
This also brings the conditions under which ULE context switches inline
with SCHED_4BSD.
- Define some new kseq_* functions for manipulating the run queue.
- Add a new kseq member ksq_rslices and ksq_bload. rslices is the sum of
the slices of runnable kses. This will be used for push load balance
decisions. bload is the number of threads blocked waiting on IO.


# 110260 03-Feb-2003 jeff

- Stop abusing oncpu for our cpu binding. Define a scheduler local element
in the kse datastructure called ke_cpu. This is the cpu which we are
currently bound to. Some flags may be added later to support hard binding.


# 110226 02-Feb-2003 scottl

Use hz if stathz is zero. Adopted from sched_4bsd.


# 110028 29-Jan-2003 jeff

- Use ksq_load as the authoritive count of kses on the pair of kseqs for
sched_runnable() et all.
- Remove some dead code in sched_clock().
- Define two macros KSEQ_SELF() and KSEQ_CPU() for getting the kseq of the
current cpu or some alternate cpu.
- Start introducing kseq_() functions, such as kseq_choose() and kseq_setup().


# 110014 28-Jan-2003 jeff

- Remove debugging code that didn't work on UP.


# 109971 28-Jan-2003 jeff

- Allow idle's pctcpu time to be calculated.


# 109970 28-Jan-2003 jeff

- Fix the ksq_load calculation. It now reflects the number of entries on the
run queue for each cpu.
- Introduce kse stealing into the sched_choose() code. This helps balance
cpus better in cases where process turnover is high. This implementation
is fairly trivial and will likely be only a temporary measure until
something more sophisticated has been written.


# 109864 26-Jan-2003 jeff

- Add the ule scheduler. This is intended to be a general purpose process
scheduler with many SMP benefits. It is still very experimental and should
be used only in test environments.