History log of /freebsd-9.3-release/sys/ia64/ia64/interrupt.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 224114 16-Jul-2011 marcel

Don't send EOI to the CPU before we handled the interrupt. This could
potentially trigger multiple pending interrupts for level-sensitive
interrupts. However, the event timer interrupt does need EOI before
being handled to avoid missing clock events.

These conflicting requirements are handled by having the XIV handler
inform the dispatch code whether or not it send EOI to the CPU. If not,
the dispatch code will do it. This allows handlers to send EOI before
doing potentially long-running activities, while still have a sensible
default behaviour.


# 223529 25-Jun-2011 marcel

Replace the original copyright notice with my own. Everything in
this file is written by me and has no bearing on the initial or
original version.


# 223526 25-Jun-2011 marcel

Switch to the event timers infrastructure. This includes:
o Setting td_intr_frame to the XIVs trap frame because it's referenced
by the ET event handler.
o Signal EOI to the CPU before calling the registered XIV handlers.
This prevents lost ITC interrupts, which cause starvation in one-shot
mode.
o Adding support for IPI_HARDCLOCK with corresponding per-CPU counters.
o Have the APs call cpu_initclocks() so as to limited the scattering of
clock related initialization. cpu_initclocks() calls the <self>_bsp()
or <self>_ap() version accordingly.
o Uncomment the ET clock handling in cpu_idle().
o Update the DDB 'show pcpu' output for the new MD fields.
o Entirely rewritten ia64_ih_clock(). Note that we don't create as many
clock XIVs as we have CPUs, as is done on PowerPC. It doesn't scale.
We can only have 240 XIVs and we can have more CPUs than that. There's
a single intrcnt index for the cumulative clock ticks and we keep per
CPU counts in the PCPU stats structure.
o Register the ITC by hooking SI_SUB_CONFIGURE (2nd order).

Open issues:
o Clock interrupts can still be lost. Some tweaking is still necessary.

Thanks to: mav@ for his support, feedback and explanations.

ET stats while committing:
eris% sysctl machdep.cpu | grep nclks

machdep.cpu.0.nclks: 24007
machdep.cpu.1.nclks: 22895
machdep.cpu.2.nclks: 13523
machdep.cpu.3.nclks: 9342
machdep.cpu.4.nclks: 9103
machdep.cpu.5.nclks: 9298
machdep.cpu.6.nclks: 10039
machdep.cpu.7.nclks: 9479
eris% vmstat -i | grep clock
clock 108599 50


# 205726 27-Mar-2010 marcel

Implement interrupt to CPU binding. Assign interrupts to CPUs in a
round-robin fashion, starting with the highest priority interrupt
on the highest-numbered CPU and cycling downwards.


# 205713 26-Mar-2010 marcel

Rename disable_intr() to ia64_disable_intr() and rename enable_intr()
to ia64_enable_intr(). This reduces confusion with intr_disable() and
intr_restore().

Have configure_final() call ia64_finalize_intr() instead of enable_intr()
in preparation of adding support for binding interrupts to all CPUs.


# 205433 22-Mar-2010 marcel

Fix interrupt handling by extending the critical region so that
preemption doesn't happen until after all pending interrupt have
been services.
While here again, simplify the EOI handling by doing it after we
call the XIV-specific handlers, rather than in each of them. The
original thought was that we may want to do an EOI first and the
actual IPI handling next, but that's mostly a micro-optimization.


# 205234 16-Mar-2010 marcel

Revamp the interrupt code based on the previous commit:
o Introduce XIV, eXternal Interrupt Vector, to differentiate from
the interrupts vectors that are offsets in the IVT (Interrupt
Vector Table). There's a vector for external interrupts, which
are based on the XIVs.

o Keep track of allocated and reserved XIVs so that we can assign
XIVs without hardcoding anything. When XIVs are allocated, an
interrupt handler and a class is specified for the XIV. Classes
are:
1. architecture-defined: XIV 15 is returned when no external
interrupt are pending,
2. platform-defined: SAL reports which XIV is used to wakeup
an AP (typically 0xFF, but it's 0x12 for the Altix 350).
3. inter-processor interrupts: allocated for SMP support and
non-redirectable.
4. device interrupts (i.e. IRQs): allocated when devices are
discovered and are redirectable.

o Rewrite the central interrupt handler to call the per-XIV
interrupt handler and rename it to ia64_handle_intr(). Move
the per-XIV handler implementation to the file where we have
the XIV allocation/reservation. Clock interrupt handling is
moved to clock.c. IPI handling is moved to mp_machdep.c.

o Drop support for the Intel 8259A because it was broken. When
XIV 0 is received, the CPU should initiate an INTA cycle to
obtain the interrupt vector of the 8259-based interrupt. In
these cases the interrupt controller we should be talking to
WRT to masking on signalling EOI is the 8259 and not the I/O
SAPIC. This requires adriver for the Intel 8259A which isn't
available for ia64. Thus stop pretending to support ExtINTs
and instead panic() so that if we come across hardware that
has an Intel 8259A, so have something real to work with.

o With XIVs for IPIs dynamically allocatedi and also based on
priority, define the IPI_* symbols as variables rather than
constants. The variable holds the XIV allocated for the IPI.

o IPI_STOP_HARD delivers a NMI if possible. Otherwise the XIV
assigned to IPI_STOP is delivered.


# 204425 27-Feb-2010 marcel

Interrupt related cleanups:
o Assign vectors based on priority, because vectors have
implied priority in hardware.
o Use unordered memory accesses to the I/O sapic and use
the acceptance form of the mf instruction.
o Remove the sapicreg.h and sapicvar.h headers. All definitions
in sapicreg.h are private to sapic.c and all definitions in
sapicvar.h are either private or interface functions. Move the
interface functions to intr.h.
o Hide the definition of struct sapic.


# 203883 14-Feb-2010 marcel

Some code churn:
o Eliminate IA64_PHYS_TO_RR6 and change all places where the macro is used
by calling either bus_space_map() or pmap_mapdev().
o Implement bus_space_map() in terms of pmap_mapdev() and implement
bus_space_unmap() in terms of pmap_unmapdev().
o Have ia64_pib hold the uncached virtual address of the processor interrupt
block throughout the kernel's life and access the elements of the PIB
through this structure pointer.

This is a non-functional change with the exception of using ia64_ld1() and
ia64_st8() to write to the PIB. We were still using assignments, for which
the compiler generates semaphore reads -- which cause undefined behaviour
for uncacheable memory. Note also that the memory barriers in ipi_send() are
critical for proper functioning.

With all the mapping of uncached memory done by pmap_mapdev(), we can keep
track of the translations and wire them in the CPU. This then eliminates
the need to reserve a whole region for uncached I/O and it eliminates
translation traps for device I/O accesses.


# 200207 07-Dec-2009 marcel

Define struct pcpu_md as the only MD field of struct pcpu (pc_acpi_id
excluded, as it's used by MI code) and mode the sysctl variables from
pcpu_stats to pcpu_md.
Adjust all references accordingly.

While nearby, change the PCPU sysctl tree so that they match the CPU
device sysctl tree -- they are now children of a static node called
"machdep.cpu" and are named only with their cpu ID.


# 199893 28-Nov-2009 marcel

Eliminate teh use of MAXCPU in static arrays of interrupt counters by
adding statistics counters to the PCPU structure. Export the counters
through sysctl by giving each PCPU structure its own sysctl context.

While here, fix cnt.v_intr by not just having it count clock interrupts,
but every interrupt and add more counters for each interrupt source.


# 199727 23-Nov-2009 marcel

Improve upon revision 196196 by removing the newly added comment
in the wrong place and instead add a KASSERT in the right place.


# 198733 31-Oct-2009 marcel

Reimplement the lazy FP context switching:
o Move all code into a single file for easier maintenance.
o Use a single global lock to avoid having to handle either
multiple locks or race conditions.
o Make sure to disable the high FP registers after saving
or dropping them.
o use msleep() to wait for the other CPU to save the high
FP registers.

This change fixes the high FP inconsistency panics.

A single global lock typically serializes too much, which may
be noticable when a lot of threads use the high FP registers,
but in that case it's probably better to switch the high FP
context synchronuously. Put differently: cpu_switch() should
switch the high FP registers if the incoming and outgoing
threads both use the high FP registers.


# 196196 13-Aug-2009 attilio

* Completely Remove the option STOP_NMI from the kernel. This option
has proven to have a good effect when entering KDB by using a NMI,
but it completely violates all the good rules about interrupts
disabled while holding a spinlock in other occasions. This can be the
cause of deadlocks on events where a normal IPI_STOP is expected.
* Adds an new IPI called IPI_STOP_HARD on all the supported architectures.
This IPI is responsible for sending a stop message among CPUs using a
privileged channel when disponible. In other cases it just does match a
normal IPI_STOP.
Right now the IPI_STOP_HARD functionality uses a NMI on ia32 and amd64
architectures, while on the other has a normal IPI_STOP effect. It is
responsibility of maintainers to eventually implement an hard stop
when necessary and possible.
* Use the new IPI facility in order to implement a new userend SMP kernel
function called stop_cpus_hard(). That is specular to stop_cpu() but
it does use the privileged channel for the stopping facility.
* Let KDB use the newly introduced function stop_cpus_hard() and leave
stop_cpus() for all the other cases
* Disable interrupts on CPU0 when starting the process of APs suspension.
* Style cleanup and comments adding

This patch should fix the reboot/shutdown deadlocks many users are
constantly reporting on mailing lists.

Please don't forget to update your config file with the STOP_NMI
option removal

Reviewed by: jhb
Tested by: pho, bz, rink
Approved by: re (kib)


# 183439 28-Sep-2008 marius

Remove ipi_all() and ipi_self() as the former hasn't been used at
all to date and the latter also is only used in ia64 and powerpc
code which no longer serves a real purpose after bring-up and just
can be removed as well. Note that architectures like sun4u also
provide no means of implementing IPI'ing a CPU itself natively
in the first place.

Suggested by: jhb
Reviewed by: arch, grehan, jhb


# 179256 23-May-2008 marcel

Account for IPI_PREEMPT. We don't want to call sched_preempt() with
interrupts disabled or with td_intr_nesting_level non-zero.


# 178215 15-Apr-2008 marcel

Support and switch to the ULE scheduler:
o Implement IPI_PREEMPT,
o Set td_lock for the thread being switched out,
o For ULE & SMP, loop while td_lock points to blocked_lock for
the thread being switched in,
o Enable ULE by default in GENERIC and SKI,


# 178131 11-Apr-2008 jeff

- Pass the irq and not the vector to intr_event_create().

Reviewed by: marcel


# 178092 11-Apr-2008 jeff

- Add the interrupt vector number to intr_event_create so MI code can
lookup hard interrupt events by number. Ignore the irq# for soft intrs.
- Add support to cpuset for binding hardware interrupts. This has the
side effect of binding any ithread associated with the hard interrupt.
As per restrictions imposed by MD code we can only bind interrupts to
a single cpu presently. Interrupts can be 'unbound' by binding them
to all cpus.

Reviewed by: jhb
Sponsored by: Nokia


# 177940 05-Apr-2008 jhb

Add a MI intr_event_handle() routine for the non-INTR_FILTER case. This
allows all the INTR_FILTER #ifdef's to be removed from the MD interrupt
code.
- Rename the intr_event 'eoi', 'disable', and 'enable' hooks to
'post_filter', 'pre_ithread', and 'post_ithread' to be less x86-centric.
Also, add a comment describe what the MI code expects them to do.
- On amd64, i386, and powerpc this is effectively a NOP.
- On arm, don't bother masking the interrupt unless the ithread is
scheduled in the non-INTR_FILTER case to match what INTR_FILTER did.
Also, don't bother unmasking the interrupt in the post_filter case if
we never masked it. The INTR_FILTER case had been doing this by having
arm_unmask_irq for the post_filter (formerly 'eoi') hook.
- On ia64, stray interrupts are now masked for the non-INTR_FILTER case.
They were already masked in the INTR_FILTER case.
- On sparc64, use the a NULL pre_ithread hook and use intr_enable_eoi() for
both the 'post_filter' and 'post_ithread' hooks to match what the
non-INTR_FILTER code did.
- On sun4v, retire the ithread wrapper hack by using an appropriate
'post_ithread' hook instead (it's what 'post_ithread'/'enable' was
designed to do even in 5.x).

Glanced at by: piso
Reviewed by: marius
Requested by: marius [1], [5]
Tested on: amd64, i386, arm, sparc64


# 177325 17-Mar-2008 jhb

Simplify the interrupt code a bit:
- Always include the ie_disable and ie_eoi methods in 'struct intr_event'
and collapse down to one intr_event_create() routine. The disable and
eoi hooks simply aren't used currently in the !INTR_FILTER case.
- Expand 'disab' to 'disable' in a few places.
- Use function casts for arm and i386:intr_eoi_src() instead of wrapper
routines since to trim one extra indirection.

Compiled on: {arm,amd64,i386,ia64,ppc,sparc64} x {FILTER, !FILTER}
Tested on: {amd64,i386} x {FILTER, !FILTER}


# 177181 14-Mar-2008 jhb

Add preliminary support for binding interrupts to CPUs:
- Add a new intr_event method ie_assign_cpu() that is invoked when the MI
code wishes to bind an interrupt source to an individual CPU. The MD
code may reject the binding with an error. If an assign_cpu function
is not provided, then the kernel assumes the platform does not support
binding interrupts to CPUs and fails all requests to do so.
- Bind ithreads to CPUs on their next execution loop once an interrupt
event is bound to a CPU. Only shared ithreads are bound. We currently
leave private ithreads for drivers using filters + ithreads in the
INTR_FILTER case unbound.
- A new intr_event_bind() routine is used to bind an interrupt event to
a CPU.
- Implement binding on amd64 and i386 by way of the existing pic_assign_cpu
PIC method.
- For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up
an interrupt source and binds its interrupt event to the specified CPU.
MI code can currently (ab)use this by doing:

intr_bind(rman_get_start(irq_res), cpu);

however, I plan to add a truly MI interface (probably a bus_bind_intr(9))
where the implementation in the x86 nexus(4) driver would end up calling
intr_bind() internally.

Requested by: kmacy, gallatin, jeff
Tested on: {amd64, i386} x {regular, INTR_FILTER}


# 173799 21-Nov-2007 scottl

Extend critical section coverage in the low-level interrupt handlers to
include the ithread scheduling step. Without this, a preemption might
occur in between the interrupt getting masked and the ithread getting
scheduled. Since the interrupt handler runs in the context of curthread,
the scheudler might see it as having a such a low priority on a busy system
that it doesn't get to run for a _long_ time, leaving the interrupt stranded
in a disabled state. The only way that the preemption can happen is by
a fast/filter handler triggering a schduling event earlier in the handler,
so this problem can only happen for cases where an interrupt is being
shared by both a fast/filter handler and an ithread handler. Unfortunately,
it seems to be common for this sharing to happen with network and USB
devices, for example. This fixes many of the mysterious TCP session
timeouts and NIC watchdogs that were being reported. Many thanks to Sam
Lefler for getting to the bottom of this problem.

Reviewed by: jhb, jeff, silby


# 171739 06-Aug-2007 marcel

Keep interrupts disabled while handling external interrupts.
There's no advantage in allowing nested external interrupts.
In fact, it leads to a potential stack overrun.

While here, put the interrupt vector in the trapframe, so as
to compensate for the 36 cycle latency of reading cr.ivr.

Further simplify assembly code by dealing with ASTs from C.

Approved by: re (blanket)


# 171664 30-Jul-2007 marcel

Rework the interrupt code and add support for interrupt filtering
(INTR_FILTER). This includes:
o Save a pointer to the sapic structure and IRQ for every vector,
so that we can quickly EOI, mask and unmask the interrupt.
o Add locking to the sapic code now that we can reprogram a
sapic on multiple CPUs at the same time.
o Use u_int for the vector and IRQ. We only have 256 vectors, so
using a 64-bit type for it is rather excessive.
o Properly handle concurrent registration of a handler for the
same vector.

Since vectors have a corresponding priority, we should not map
IRQs to vectors in a linear fashion, but rather pick a vector
that has a priority in line with the interrupt type. This is left
for later. The vector/IRQ interchange has been untangled as much
as possible to make this easier.

Approved by: re (blacket)


# 170291 04-Jun-2007 attilio

Rework the PCPU_* (MD) interface:
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
given a specific value.

Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 170162 31-May-2007 piso

In some particular cases (like in pccard and pccbb), the real device
handler is wrapped in a couple of functions - a filter wrapper and an
ithread wrapper. In this case (and just in this case), the filter
wrapper could ask the system to schedule the ithread and mask the
interrupt source if the wrapped handler is composed of just an ithread
handler: modify the "old" interrupt code to make it support
this situation, while the "new" interrupt code is already ok.

Discussed with: jhb


# 166901 23-Feb-2007 piso

o break newbus api: add a new argument of type driver_filter_t to
bus_setup_intr()

o add an int return code to all fast handlers

o retire INTR_FAST/IH_FAST

For more info: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=465712+0+current/freebsd-current

Reviewed by: many
Approved by: re@


# 164392 18-Nov-2006 marcel

Now that printf() needs the PCPU, set it up before we call printf().
Change the pc_pcb field from a pointer to struct pcb to struct pcb
so that sizeof(struct pcb) includes the PCB we use for IPI_STOP.
Statically declare early_pcb so that we don't have to allocate the
PCB for thread0. This way we can setup the PCPU before cninit()
and thus before we use printf().


# 157449 03-Apr-2006 marcel

Improve handling of IPI_STOP:
o use atomic operations to fiddle with stopped_cpus and started_cpus.
o disable interrupts while we're waiting to be started.
o remove logic relating to cpustop_restartfunc as it's not used.


# 153666 22-Dec-2005 jhb

Tweak how the MD code calls the fooclock() methods some. Instead of
passing a pointer to an opaque clockframe structure and requiring the
MD code to supply CLKF_FOO() macros to extract needed values out of the
opaque structure, just pass the needed values directly. In practice this
means passing the pair (usermode, pc) to hardclock() and profclock() and
passing the boolean (usermode) to hardclock_cpu() and hardclock_process().
Other details:
- Axe clockframe and CLKF_FOO() macros on all architectures. Basically,
all the archs were taking a trapframe and converting it into a clockframe
one way or another. Now they can just extract the PC and usermode values
directly out of the trapframe and pass it to fooclock().
- Renamed hardclock_process() to hardclock_cpu() as the latter is more
accurate.
- On Alpha, we now run profclock() at hz (profhz == hz) rather than at
the slower stathz.
- On Alpha, for the TurboLaser machines that don't have an 8254
timecounter, call hardclock() directly. This removes an extra
conditional check from every clock interrupt on Alpha on the BSP.
There is probably room for even further pruning here by changing Alpha
to use the simplified timecounter we use on x86 with the lapic timer
since we don't get interrupts from the 8254 on Alpha anyway.
- On x86, clkintr() shouldn't ever be called now unless using_lapic_timer
is false, so add a KASSERT() to that affect and remove a condition
to slightly optimize the non-lapic case.
- Change prototypeof arm_handler_execute() so that it's first arg is a
trapframe pointer rather than a void pointer for clarity.
- Use KCOUNT macro in profclock() to lookup the kernel profiling bucket.

Tested on: alpha, amd64, arm, i386, ia64, sparc64
Reviewed by: bde (mostly)


# 151885 30-Oct-2005 marcel

Remove a stray return statement in the interrupt dispatch function
that caused a premature exit after calling a fast interrupt handler
and bypassing a much needed critical_exit() and the scheduling of
the interrupt thread for non-fast handlers. In short: unbreak :-)


# 151658 25-Oct-2005 jhb

Reorganize the interrupt handling code a bit to make a few things cleaner
and increase flexibility to allow various different approaches to be tried
in the future.
- Split struct ithd up into two pieces. struct intr_event holds the list
of interrupt handlers associated with interrupt sources.
struct intr_thread contains the data relative to an interrupt thread.
Currently we still provide a 1:1 relationship of events to threads
with the exception that events only have an associated thread if there
is at least one threaded interrupt handler attached to the event. This
means that on x86 we no longer have 4 bazillion interrupt threads with
no handlers. It also means that interrupt events with only INTR_FAST
handlers no longer have an associated thread either.
- Renamed struct intrhand to struct intr_handler to follow the struct
intr_foo naming convention. This did require renaming the powerpc
MD struct intr_handler to struct ppc_intr_handler.
- INTR_FAST no longer implies INTR_EXCL on all architectures except for
powerpc. This means that multiple INTR_FAST handlers can attach to the
same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach
to the same interrupt. Sharing INTR_FAST handlers may not always be
desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun
either. Drivers can always still use INTR_EXCL to ask for an interrupt
exclusively. The way this sharing works is that when an interrupt
comes in, all the INTR_FAST handlers are executed first, and if any
threaded handlers exist, the interrupt thread is scheduled afterwards.
This type of layout also makes it possible to investigate using interrupt
filters ala OS X where the filter determines whether or not its companion
threaded handler should run.
- Aside from the INTR_FAST changes above, the impact on MD interrupt code
is mostly just 's/ithread/intr_event/'.
- A new MI ddb command 'show intrs' walks the list of interrupt events
dumping their state. It also has a '/v' verbose switch which dumps
info about all of the handlers attached to each event.
- We currently don't destroy an interrupt thread when the last threaded
handler is removed because it would suck for things like ppbus(8)'s
braindead behavior. The code is present, though, it is just under
#if 0 for now.
- Move the code to actually execute the threaded handlers for an interrrupt
event into a separate function so that ithread_loop() becomes more
readable. Previously this code was all in the middle of ithread_loop()
and indented halfway across the screen.
- Made struct intr_thread private to kern_intr.c and replaced td_ithd
with a thread private flag TDP_ITHREAD.
- In statclock, check curthread against idlethread directly rather than
curthread's proc against idlethread's proc. (Not really related to intr
changes)

Tested on: alpha, amd64, i386, sparc64
Tested on: arm, ia64 (older version of patch by cognet and marcel)


# 149915 09-Sep-2005 marcel

Change the High FP lock from a sleep lock to a spin lock. We can
take the lock from interrupt context, which causes an implicit
lock order reversal. We've been using the lock carefully enough
that making it a spin lock should not be harmful.


# 148807 06-Aug-2005 marcel

Improve SMP support:
o Allocate a VHPT per CPU. The VHPT is a hash table that the CPU
uses to look up translations it can't find in the TLB. As such,
the VHPT serves as a level 1 cache (the TLB being a level 0 cache)
and best results are obtained when it's not shared between CPUs.
The collision chain (i.e. the hash bucket) is shared between CPUs,
as all buckets together constitute our collection of PTEs. To
achieve this, the collision chain does not point to the first PTE
in the list anymore, but to a hash bucket head structure. The
head structure contains the pointer to the first PTE in the list,
as well as a mutex to lock the bucket. Thus, each bucket is locked
independently of each other. With at least 1024 buckets in the VHPT,
this provides for sufficiently finei-grained locking to make the
ssolution scalable to large SMP machines.
o Add synchronisation to the lazy FP context switching. We do this
with a seperate per-thread lock. On SMP machines the lazy high FP
context switching without synchronisation caused inconsistent
state, which resulted in a panic. Since the use of the high FP
registers is not common, it's possible that races exist. The ia64
package build has proven to be a good stress test, so this will
get plenty of exercise in the near future.
o Don't use the local ID of the processor we want to send the IPI to
as the argument to ipi_send(). use the struct pcpu pointer instead.
The reason for this is that IPI delivery is unreliable. It has been
observed that sending an IPI to a CPU causes it to receive a stray
external interrupt. As such, we need a way to make the delivery
reliable. The intended solution is to queue requests in the target
CPU's per-CPU structure and use a single IPI to inform the CPU that
there's a new entry in the queue. If that IPI gets lost, the CPU
can check it's queue at any convenient time (such as for each
clock interrupt). This also allows us to send requests to a CPU
without interrupting it, if such would be beneficial.

With these changes SMP is almost working. There are still some random
process crashes and the machine can hang due to having the IPI lost
that deals with the high FP context switch.

The overhead of introducing the hash bucket head structure results
in a performance degradation of about 1% for UP (extra pointer
indirection). This is surprisingly small and is offset by gaining
reasonably/good scalable SMP support.


# 144971 12-Apr-2005 jhb

Use PCPU_LAZY_INC() for cnt.v_{intr,trap,syscalls} rather than atomic
operations in some places and simple non-per CPU math in others.


# 144962 12-Apr-2005 marcel

Dot the i's:
1 Move the debug.clock_adjust_* sysctls to debug.clock.adjust_* to
make it easier to get only the clock statistics.
2 Make the sysctls read-only [suggested by Marius].
3 When determining the new clock adjustment, we checked for an error
either larger than 12.5% or smaller than 12.5%. We left out an error
of exactly 12.5%. For errors larger than 12.5% we adjust the clock
reload value in such a way that the next clock interrupt would be
early (as in premature). For errors less than 12.5% we stopped the
adjustment.
The current algorithm doesn't benefit from excluding an error of
exactly 12.5%. Change the code to stop adjusting the clock if the
error is *not* larger than 12.5% [suggested by Marius].

Discussed with: marius@


# 139790 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 131481 02-Jul-2004 jhb

Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by: scottl (with his re@ hat)


# 129022 07-May-2004 marcel

Make sure to sanitize the FP status register. Specifically this
masks all FP traps, which should not happen in the kernel.


# 124737 20-Jan-2004 marcel

s/framep/tf/g -- this normalizes on the use of tf to point to the
trapframe and improves grep-ability.


# 122841 17-Nov-2003 peter

Widen the enable/disable helper function's argument in line with the
ithread_create() changes etc. This should be mostly a NOP.


# 122518 11-Nov-2003 marcel

Further work-out the handling of the high FP registers. The most
important change is in cpu_switch() where we disable the high FP
registers for the thread that we switch-out if the CPU currently
has its high FP registers. This avoids that the high FP registers
remain enabled for the thread even when the CPU has unloaded them
or the thread migrated to another processor.
Likewise, when we switch-in a thread of that has its high FP
registers on the CPU, we enable them. This avoids an otherwise
harmless, but unnecessary trap to have them enabled.

The code that handles the disabled high FP trap (in trap()) has
been turned into a critical section for the most part to avoid
being preempted. If there's a race, we bail out and have the
processor trap again if necessary.

Avoid using the generic ia64_highfp_save() function when the
context is predictable. The function adds unnecessary overhead.
Don't use ia64_highfp_load() for the same reason. The function
is now unused and can be removed.

These changes make the lazy context switching of the high FP
registers in an UP kernel functional.


# 119970 10-Sep-2003 marcel

Rewrite the SAPIC initialization to always program the RTEs with what
we think is the correct trigger mode and polarity. This allows us to
implement BUS_CONFIG_INTR() as an update of the RTE in question.
Consequently, we can trust the RTE when we enable an interrupt and
avoids that we need to know about the trigger mode and polarity at
that time.


# 119787 05-Sep-2003 marcel

Fix a place where I forgot to change the code that checks whether
we return to kernel or userland. This triggered a panic in a KSE
application when TDF_USTATCLOCK was set in the case userland was
interrupted, but we never called ast() on our way out. As such,
we called ast() at some other time. Unfortunately, TDF_USTATCLOCK
handling assumes running in the interrupt thread. This was not
the case anymore.

To avoid making the same mistake later, interrupt() now returns
to its caller whether we interrupted userland or not. This avoids
that we have to duplicate the check in assembly, where it's bound
to fall off the scope. Now we simply check the return value and
call ast() if appropriate.

Run into this: davidxu


# 118990 16-Aug-2003 marcel

Further cleanup <machine/cpu.h> and <machine/md_var.h>: move the MI
prototypes of cpu_halt(), cpu_reset() and swi_vm() from md_var.h to
cpu.h. This affects db_command.c and kern_shutdown.c.

ia64: move all MD prototypes from cpu.h to md_var.h. This affects
madt.c, interrupt.c and mp_machdep.c. Remove is_physical_memory().
It's not used (vm_machdep.c).

alpha: the MD prototypes have been left in cpu.h with a comment
that they should be there. Moving them is left for later. It was
expected that the impact would be significant enough to be done in
a seperate commit.

powerpc: MD prototypes left in cpu.h. Comment added.

Suggested by: bde
Tested with: make universe (pc98 incomplete)


# 118414 04-Aug-2003 marcel

Cleanup the clock code. This includes:
o Remove alpha specific timer code (mc146818A) and compiled-out
calibration of said timer.
o Remove i386 inherited timer code (i8253) and related acquire and
release functions.
o Move sysbeep() from clock.c to machdep.c and have it return
ENODEV. Console beeps should be implemented using ACPI or if no
such device is described, using the sound driver.
o Move the sysctls related to adjkerntz, disable_rtc_set and
wall_cmos_clock from machdep.c to clock.c, where the variables
are.
o Don't hardcode a hz value of 1024 in cpu_initclocks() and don't
bother faking a stathz that's 1/8 of that. Keep it simple: hz
defaults to HZ and stathz equals hz. This is also how it's done
for sparc64.
o Keep a per-CPU ITC counter (pc_clock) and adjustment (pc_clockadj)
to calculate ITC skew and corrections. On average, we adjust the
ITC match register once every ~1500 interrupts for a duration of
2 consequtive interruprs. This is to correct the non-deterministic
behaviour of the ITC interrupt (there's a delay between the match
and the raising of the interrupt).
o Add 4 debugging sysctls to monitor clock behaviour. Those are
debug.clock_adjust_edges, debug.clock_adjust_excess,
debug.clock_adjust_lost and debug.clock_adjust_ticks. The first
counts the individual adjustment cycles (when the skew first
crosses the threshold), the second counts the number of times the
adjustment was excessive (any non-zero value is to be considered
a bug), the third counts lost clock interrupts and the last counts
the number of interrupts for which we applied an adjustment
(debug.clock_adjust_ticks / debug.clock_adjust_edges gives the
avarage duration of an individual adjustment -- should be ~2).

While here, remove some nearby (trivial) left-overs from alpha and
other cleanups.


# 115084 16-May-2003 marcel

Revamp of the syscall path, exception and context handling. The
prime objectives are:
o Implement a syscall path based on the epc inststruction (see
sys/ia64/ia64/syscall.s).
o Revisit the places were we need to save and restore registers
and define those contexts in terms of the register sets (see
sys/ia64/include/_regset.h).

Secundairy objectives:
o Remove the requirement to use contigmalloc for kernel stacks.
o Better handling of the high FP registers for SMP systems.
o Switch to the new cpu_switch() and cpu_throw() semantics.
o Add a good unwinder to reconstruct contexts for the rare
cases we need to (see sys/contrib/ia64/libuwx)

Many files are affected by this change. Functionally it boils
down to:
o The EPC syscall doesn't preserve registers it does not need
to preserve and places the arguments differently on the stack.
This affects libc and truss.
o The address of the kernel page directory (kptdir) had to
be unstaticized for use by the nested TLB fault handler.
The name has been changed to ia64_kptdir to avoid conflicts.
The renaming affects libkvm.
o The trapframe only contains the special registers and the
scratch registers. For syscalls using the EPC syscall path
no scratch registers are saved. This affects all places where
the trapframe is accessed. Most notably the unaligned access
handler, the signal delivery code and the debugger.
o Context switching only partly saves the special registers
and the preserved registers. This affects cpu_switch() and
triggered the move to the new semantics, which additionally
affects cpu_throw().
o The high FP registers are either in the PCB or on some
CPU. context switching for them is done lazily. This affects
trap().
o The mcontext has room for all registers, but not all of them
have to be defined in all cases. This mostly affects signal
delivery code now. The *context syscalls are as of yet still
unimplemented.

Many details went into the removal of the requirement to use
contigmalloc for kernel stacks. The details are mostly CPU
specific and limited to exception_save() and exception_restore().
The few places where we create, destroy or switch stacks were
mostly simplified by not having to construct physical addresses
and additionally saving the virtual addresses for later use.

Besides more efficient context saving and restoring, which of
course yields a noticable speedup, this also fixes the dreaded
SMP bootup problem as a side-effect. The details of which are
still not fully understood.

This change includes all the necessary backward compatibility
code to have it handle older userland binaries that use the
break instruction for syscalls. Support for break-based syscalls
has been pessimized in favor of a clean implementation. Due to
the overall better performance of the kernel, this will still
be notived as an improvement if it's noticed at all.

Approved by: re@ (jhb)


# 110296 03-Feb-2003 jake

Split statclock into statclock and profclock, and made the method for driving
statclock based on profhz when profiling is enabled MD, since most platforms
don't use this anyway. This removes the need for statclock_process, whose
only purpose was to subdivide profhz, and gets the profiling clock running
outside of sched_lock on platforms that implement suswintr.
Also changed the interface for starting and stopping the profiling clock to
do just that, instead of changing the rate of statclock, since they can now
be separate.

Reviewed by: jhb, tmm
Tested on: i386, sparc64


# 110190 01-Feb-2003 julian

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 109911 26-Jan-2003 julian

Unbreak SMP cases for these architectures.
statclock_process() changed arguments.
note: it may be worth checking if curkse is needed on these architectures..
(and if so, why?)


# 108759 06-Jan-2003 marcel

Move ia64_sapics and ia64_sapic_count from interrupt.c to sapic.c
and declare them extern in interrupt.c. This eliminates the need
for ia64_add_sapic(), which is called from sapic.c.
While here, reformat ia64_enable() in interrupt.c to improve
indentation and add a sysctl (machdep.apic) to dump the I/O APIC
entries currently programmed into all I/O APICs. The latter can
help analyze interrupt problems.
Note that the sysctl is not intended as a userland (software)
interface. It may be changed in the future to include counters
so that vmstat -i can make use of it. It may also be removed...


# 108757 05-Jan-2003 peter

Move the itm reload to a single place rather than having two identical
copies of the reload. Note that we use the precomputed itm_reload value
so that we can avoid a division in the kernel. The ia64 cpu does not
have integer divide, so this would have been done by a floating point
operation.


# 108756 05-Jan-2003 marcel

Replace the hardcoding of 255 as the clock interrupt vector with
CLOCK_VECTOR and define it as 254, not 255. Vector 255 is already
in use as the AP wakeup vector on the HP rx2600.

This needs to be made more dynamic. The likelyhood of vector 254
being in use is pretty small, but we already have code to assign
vectors to IPIs (see sal.c) and it's preobably better to have a
centralized "vector manager" that hands out vectors based on
some imput (like priority).


# 108751 05-Jan-2003 marcel

Manually inline handleclock(). There's only a single caller and
handleclock itself is trivial.

While here, replace (itc_frequency+hz/2)/hz with itm_reload for
consistency. There's now a single place where we determine the
ITM reload value.


# 108749 05-Jan-2003 marcel

Count interrupts as soon as possible. This makes sure interrupts are
counted even when there are no handlers.


# 106066 27-Oct-2002 marcel

Make vmstat -i work:
o Properly set the pointer to the counter for each interrupt and
update the intrnames table.
o Remove Alpha cruft from intrcnt.h.
o Create INTRNAME_LEN as the single entity that defines the width
of the names in the intrnames table (incl. terminatinf '\0').


# 104433 03-Oct-2002 peter

Do a bit of rude hackery to get clock interrupts on all CPUs. This
is partly based on the Alpha system which duplicates the clock to
each cpu, instead of doing a clock roundrobin like on i386. This means
we get hz * ncpu clocks per second and so we have to seperate clock
sampling from actual 'do the work' clock processing. The BSP runs the
complete processing, the rest just sample state etc.

Using the on-cpu interval timer is not ideal as it will drift. There
is more to be done here, we should use an external clock source.


# 103436 16-Sep-2002 peter

Initiate deorbit burn for the i386-only a.out related support. Moves are
under way to move the remnants of the a.out toolchain to ports. As the
comment in src/Makefile said, this stuff is deprecated and one should not
expect this to remain beyond 4.0-REL. It has already lasted WAY beyond
that.

Notable exceptions:
gcc - I have not touched the a.out generation stuff there.
ldd/ldconfig - still have some code to interface with a.out rtld.
old as/ld/etc - I have not removed these yet, pending their move to ports.
some includes - necessary for ldd/ldconfig for now.

Tested on: i386 (extensively), alpha


# 96912 19-May-2002 marcel

o Remove namespace pollution from param.h:
- Don't include ia64_cpu.h and cpu.h
- Guard definitions by _NO_NAMESPACE_POLLUTION
- Move definition of KERNBASE to vmparam.h

o Move definitions of IA64_RR_{BASE|MASK} to vmparam.h
o Move definitions of IA64_PHYS_TO_RR{6|7} to vmparam.h

o While here, remove some left-over Alpha references.


# 96442 12-May-2002 marcel

o Rename ia64_count_aps to ia64_count_cpus and reimplement the
function to return the total number of CPUs and not the highest
CPU id.
o Define mp_maxid based on the minimum of the actual number of
CPUs in the system and MAXCPU.
o In cpu_mp_add, when the CPU id of the CPU we're trying to add
is larger than mp_maxid, don't add the CPU. Formerly this was
based on MAXCPU. Don't count CPUs when we add them. We already
know how many CPUs exist.
o Replace MAXCPU with mp_maxid when used in loops that iterate
over the id space. This avoids a couple of useless iterations.
o In cpu_mp_unleash, use the number of CPUs to determine if we
need to launch the CPUs.
o Remove mp_hardware as it's not used anymore.
o Move the IPI vector array from mp_machdep.c to sal.c. We use
the array as a centralized place to collect vector assignments.
Note that we still assign vectors to SMP specific IPIs in
non-SMP configurations. Rename the array from mp_ipi_vector to
ipi_vector.
o Add IPI_MCA_RENDEZ and IPI_MCA_CMCV. These are used by MCA.
Note that IPI_MCA_CMCV is not SMP specific.
o Initialize the ipi_vector array so that we place the IPIs in
sensible priority classes. The classes are relative to where
the AP wake-up vector is located to guarantee that it's the
highest priority (external) interrupt. Class assignment is
as follows:
class IPI notes
x AP wake-up (normally x=15)
x-1 MCA rendezvous
x-2 AST, Rendezvous, stop
x-3 CMCV, test


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 92268 14-Mar-2002 dfr

* Add some KTR messages for IPIs.
* Don't call ast() from interrupt() - if we switch, then we will miss
writing cr.eoi which will prevent the current cpu from receiving
interrupts until the current thread is resumed. The call to ast()
happens magically in exception_restore where it is safe.
* Add DDB 'show irq' command to examine interrupt hardware state.


# 92105 11-Mar-2002 jhb

Fix a misspelling of mine: s/optomization/optimization/.

Noticed by: bmilekic


# 91669 05-Mar-2002 marcel

Call ast() only when we're handling a user trap.


# 88903 05-Jan-2002 peter

Convert a bunch of 1 << PCPU_GET(cpuid) to PCPU_GET(cpumask).


# 88900 05-Jan-2002 jhb

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


# 88687 30-Dec-2001 marcel

Draft implementation of IPI handling.


# 85684 29-Oct-2001 dfr

Make the various bits of SMP code conditional on SMP so that I can still
build non-SMP kernels.


# 85674 29-Oct-2001 marcel

o Send a test IPI from the BSP to itself at the same time APs
are woken up.
o Make IPIs synchronuous by default. If we want asynchronuous
IPIs, we may want to make the memory fence controllable.


# 85670 29-Oct-2001 marcel

Make the clock vector 255 instead of 240. On Lion boxes, 240 is
the AP wake-up vector. We probably want a more dynamic approach
to assigning vectors in the future...


# 84541 05-Oct-2001 dfr

Wire up most of the interrupt handling infrastructure. Not sure it works
right yet but its enough for the ATA probe to work. The SCSI probes which
follow are broken though.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 78887 27-Jun-2001 jhb

Make this compile again. Broken since June 1.


# 76410 09-May-2001 jhb

Add needed sys/lock.h include.


# 75700 19-Apr-2001 dfr

Don't take the Giant mutex for clock interrupts.


# 74732 24-Mar-2001 jhb

Stick a prototype for handleclock() in machine/clock.h and include it
interrupt.c to quiet a warning.


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 71337 21-Jan-2001 jake

Make intr_nesting_level per-process, rather than per-cpu. Setup
interrupt threads to run with it always >= 1, so that malloc can
detect M_WAITOK from "interrupt" context. This is also necessary
in order to context switch from sched_ithd() directly.

Reviewed By: peter


# 70861 10-Jan-2001 jake

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 67522 24-Oct-2000 dfr

* Various fixes to breakage introduced by the atomic and mutex reorgs.
* Fixes to the signal delivery code. Not quite right yet.

I would have preferred to wait until I have signal delivery actually
working but the current kernel in CVS doesn't build.


# 67195 16-Oct-2000 dfr

Remember to re-initialise cr.itm on clock interrupts so that we get more
than just one tick.


# 67032 12-Oct-2000 dfr

Implement a rudimentary interrupt handling system which should be good
enough for clock interrupts in SKI.


# 66458 29-Sep-2000 dfr

This is the first snapshot of the FreeBSD/ia64 kernel. This kernel will
not work on any real hardware (or fully work on any simulator). Much more
needs to happen before this is actually functional but its nice to see
the FreeBSD copyright message appear in the ia64 simulator.