#
1.1 |
|
07-Jul-2023 |
riastradh |
heartbeat(9): New mechanism to check progress of kernel.
This uses hard interrupts to check progress of low-priority soft interrupts, and one CPU to check progress of another CPU.
If no progress has been made after a configurable number of seconds (kern.heartbeat.max_period, default 15), then the system panics -- preferably on the CPU that is stuck so we get a stack trace in dmesg of where it was stuck, but if the stuckness was detected by another CPU and the stuck CPU doesn't acknowledge the request to panic within one second, the detecting CPU panics instead.
This doesn't supplant hardware watchdog timers. It is possible for hard interrupts to be stuck on all CPUs for some reason too; in that case heartbeat(9) has no opportunity to complete.
Downside: heartbeat(9) relies on hardclock to run at a reasonably consistent rate, which might cause trouble for the glorious tickless future. However, it could be adapted to take a parameter for an approximate number of units that have elapsed since the last call on the current CPU, rather than treating that as a constant 1.
XXX kernel revbump -- changes struct cpu_info layout
|
#
1.1 |
|
07-Jul-2023 |
riastradh |
heartbeat(9): New mechanism to check progress of kernel.
This uses hard interrupts to check progress of low-priority soft interrupts, and one CPU to check progress of another CPU.
If no progress has been made after a configurable number of seconds (kern.heartbeat.max_period, default 15), then the system panics -- preferably on the CPU that is stuck so we get a stack trace in dmesg of where it was stuck, but if the stuckness was detected by another CPU and the stuck CPU doesn't acknowledge the request to panic within one second, the detecting CPU panics instead.
This doesn't supplant hardware watchdog timers. It is possible for hard interrupts to be stuck on all CPUs for some reason too; in that case heartbeat(9) has no opportunity to complete.
Downside: heartbeat(9) relies on hardclock to run at a reasonably consistent rate, which might cause trouble for the glorious tickless future. However, it could be adapted to take a parameter for an approximate number of units that have elapsed since the last call on the current CPU, rather than treating that as a constant 1.
XXX kernel revbump -- changes struct cpu_info layout
|