History log of /freebsd-11-stable/sys/kern/subr_witness.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 332734 18-Apr-2018 shurd

MFC r332388:

Make BPF global lock an SX

This allows NIC drivers to sleep on polling config operations.

PR: 323477
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14982


# 330896 14-Mar-2018 eadler

MFC r304804:

Less-quick fix for locking fixes in r172250. r172250 added a second
syscons spinlock for the output routine alone. It is better to extend
the coverage of the first syscons spinlock added in r162285. 2 locks
might work with complicated juggling, but no juggling was done. What
the 2 locks actually did was to cover some of the missing locking in
each other and deadlock less often against each other than a single
lock with larger coverage would against itself. Races are preferable
to deadlocks here, but 2 locks are still worse since they are harder
to understand and fix.

Prefer deadlocks to races and merge the second lock into the first one.

Extend the scope of the spinlocking to all of sc_cnputc() instead of
just the sc_puts() part. This further prefers deadlocks to races.

Extend the kdb_active hack from sc_puts() internals for the second lock
to all spinlocking. This reduces deadlocks much more than the other
changes increases them. The s/p,10* test in ddb gets much further now.
Hide this detail in the SC_VIDEO_LOCK() macro. Add namespace pollution
in 1 nested #include and reduce namespace pollution in other nested
#includes to pay for this.

Move the first lock higher in the witness order. The second lock was
unnaturally low and the first lock was unnaturally high. The second
lock had to be above "sleepq chain" and/or "callout" to avoid spurious
LORs for visual bells in sc_puts(). Other console driver locks are
already even higher (but not adjacent like they should be) except when
they are missing from the table. Audio bells also benefit from the
syscons lock being high so that audio mutexes have chance of being
lower. Otherwise, console drviver locks should be as low as possible.
Non-spurious LORs now occur if the bell code calls printf() or is
interrupted (perhaps by an NMI) and the interrupt handler calls
printf(). Previous commits turned off many bells in console i/o but
missed ones done by the teken layer.


# 322273 08-Aug-2017 markj

MFC r321884, r321896:
Fix a witness assertion that fires when a lock type's class changes.


# 314063 22-Feb-2017 markj

MFC r313261:
Make witness_warn() always print to the console.


# 310959 31-Dec-2016 mjg

MFC r305378,r305379,r305386,r305684,r306224,r306608,r306803,r307650,r307685,
r308407,r308665,r308667,r309067:

cache: put all negative entry management code into dedicated functions

==
cache: manage negative entry list with a dedicated lock

Since negative entries are managed with a LRU list, a hit requires a
modificaton.

Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.

Provide a dedicated lock for use when the cache is only shared-locked.

==

cache: defer freeing entries until after the global lock is dropped

This also defers vdrop for held vnodes.

==

cache: improve scalability by introducing bucket locks

An array of bucket locks is added.

All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.

This is an intermediate step towards removal of the global lock.

==

cache: get rid of the global lock

Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.

Lookups still require the relevant bucket to be locked.

==

cache: ignore purgevfs requests for filesystems with few vnodes

purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.

In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.

Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.

The default limit is the number of bucket locks.

== (by kib)

Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

==

cache: split negative entry LRU into multiple lists

This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.

Create N dedicated lists for new negative entries.

Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.

Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.

==

cache: fix up a corner case in r307650

If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.

Fix the problem by initializing to NULL on entry.

== (by kib)

vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.

==

cache: fix a race between entry removal and demotion

The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

==

cache: plug a write-only variable in cache_negative_zap_one

==

cache: ensure that the number of bucket locks does not exceed hash size

The size can be changed by side effect of modifying kern.maxvnodes.

Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.

Force the number of buckets to be not smaller.

Note this should not matter for real world cases.


# 308350 05-Nov-2016 markj

MFC r308097:
Fix WITNESS hints for pagequeue locks.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 298819 29-Apr-2016 pfg

sys/kern: spelling fixes in comments.

No functional change.


# 298411 21-Apr-2016 pfg

Remove slightly used const values that can be replaced with nitems().

Suggested by: jhb


# 298310 19-Apr-2016 pfg

kernel: use our nitems() macro when it is available through param.h.

No functional change, only trivial cases are done in this sweep,

Discussed in: freebsd-current


# 291217 23-Nov-2015 markj

The buffer passed to an sbuf drain callback is not necessarily
null-terminated, so don't assume that it is.

Reported by: pho
X-MFC-With: r291059


# 291060 19-Nov-2015 markj

Remove a commented-out debug print.

MFC after: 1 week


# 291059 19-Nov-2015 markj

Add support for a configurable output channel to witness(4).

This is useful in environments where system configuration is performed by
automated interaction with the system console, since unexpected witness
output makes such automation difficult. With this change, the new
debug.witness.output_channel sysctl allows one to specify that witness
output is to be printed to the kernel log (using log(9)) rather than the
console.

Reviewed by: cem, jhb
MFC after: 2 weeks
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4183


# 285839 24-Jul-2015 marius

o Revert the other functional half of r239864, i. e. the merge of r134227
from x86 to use smp_ipi_mtx spin lock not only for smp_rendezvous_cpus()
but also for the MD cache invalidation, TLB demapping and remote register
reading IPIs due to the following reasons:
- The cross-IPI SMP deadlock x86 otherwise is subject to can't happen on
sparc64. That's because on sparc64, spin locks don't disable interrupts
completely but only raise the processor interrupt level to PIL_TICK. This
means that IPIs still get delivered and direct dispatch IPIs such as the
cache invalidation etc. IPIs in question are still executed.
- In smp_rendezvous_cpus(), smp_ipi_mtx is held not only while sending an
IPI_RENDEZVOUS, but until all CPUs have processed smp_rendezvous_action().
Consequently, smp_ipi_mtx may be locked for an extended amount of time as
queued IPIs (as opposed to the direct ones) such as IPI_RENDEZVOUS are
scheduled via a soft interrupt. Moreover, given that this soft interrupt
is only delivered at PIL_RENDEZVOUS, processing of smp_rendezvous_action()
on a target may be interrupted by f. e. a tick interrupt at PIL_TICK, in
turn leading to the target in question trying to send an IPI by itself
while IPI_RENDEZVOUS isn't fully handled, yet, and, thus, resulting in a
deadlock.
o As mentioned in the commit message of r245850, on least some sun4u platforms
concurrent sending of IPIs by different CPUs is fatal. Therefore, hold the
reintroduced MD ipi_mtx also while delivering cross-traps via MI helpers,
i. e. ipi_{all_but_self,cpu,selected}().
o Akin to x86, let the last CPU to process cpu_mp_bootstrap() set smp_started
instead of the BSP in cpu_mp_unleash(). This ensures that all APs actually
are started, when smp_started is no longer 0.
o In all MD and MI IPI helpers, check for smp_started == 1 rather than for
smp_cpus > 1 or nothing at all. This avoids races during boot causing IPIs
trying to be delivered to APs that in fact aren't up and running, yet.
While at it, move setting of the cpu_ipi_{selected,single}() pointers to
the appropriate delivery functions from mp_init() to cpu_mp_start() where
it's better suited and allows to get rid of the global isjbus variable.
o Given that now concurrent IPI delivery no longer is possible, also nuke
the delays before completely disabling interrupts again in the CPU-specific
cross-trap delivery functions, previously giving other CPUs a window for
sending IPIs on their part. Actually, we now should be able to entirely get
rid of completely disabling interrupts in these functions. Such a change
needs more testing, though.
o In {s,}tick_get_timecount_mp(), make the {s,}tick variable static. While not
necessary for correctness, this avoids page faults when accessing the stack
of a foreign CPU as {s,}tick now is locked into the TLBs as part of static
kernel data. Hence, {s,}tick_get_timecount_mp() always execute as fast as
possible, avoiding jitter.

PR: 201245
MFC after: 3 days


# 285252 07-Jul-2015 markj

Fix an incorrect assertion in witness.

The number of available lock list entries for a thread is LOCK_CHILDCOUNT,
and each entry can record up to LOCK_NCHILDREN locks. When iterating over
the locks held by a thread, a bound on the loop index is therefore given
by LOCK_CHILDCOUNT * LOCK_NCHILDREN; WITNESS_COUNT is an unrelated
constant.

Reviewed by: jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D2974


# 284127 07-Jun-2015 markj

witness: don't warn about matrix inconsistencies without holding the mutex

Lock order checking is done without the witness mutex held, so multiple
threads that are racing to establish a new lock order may read matrix
entries that are in an inconsistent state. Don't print a warning in this
case, but instead just redo the check after taking the witness lock.

Differential Revision: https://reviews.freebsd.org/D2713
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division


# 283248 21-May-2015 pfg

ddb: finish converting boolean values.

The replacement started at r283088 was necessarily incomplete without
replacing boolean_t with bool. This also involved cleaning some type
mismatches and ansifying old C function declarations.

Pointed out by: bde
Discussed with: bde, ian, jhb


# 279390 28-Feb-2015 kib

The umtx_lock mutex is used by top-half of the kernel, but is
currently a spin lock. Apparently, the only reason for this is that
umtx_thread_exit() is called under the process spinlock, which put the
requirement on the umtx_lock. Note that the witness static order list
is wrong for the umtx_lock, umtx_lock is explicitely before any thread
lock, so it is also before sleepq locks.

Change umtx_lock to be the sleepable mutex. For the reason above, the
calls to umtx_thread_exit() are moved from thread_exit() earlier in
each caller, when the process spin lock is not yet taken.

Discussed with: jhb
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 273342 20-Oct-2014 markj

Fix a typo from r189544, which replaced unp_global_rwlock with unp_list_lock
and unp_link_rwlock.

MFC after: 3 days


# 272944 11-Oct-2014 marcel

Fix nits in previous commit:
1. Remove initializer for badstack_sbuf_size; it gets set unconditionally.
2. Remove meaningless comment.
3. Group witness_count and its sysctl together.
4. Fix spacing in for statements (space after for and within condition).
5. Change *all* M_NOWAIT usages in witness_initialize() to M_WAITOK; not
just those that were newly introduced -- the allocation is assumed to
succeed for all allocations.
6. Avoid using uint8_t as the base type in sizeof() expressions; Use the
variable name (w_rmatrix) as much as possible.

Pointed out by: jhb@ (thanks!)


# 272925 11-Oct-2014 marcel

Turn WITNESS_COUNT into a tunable and sysctl. This allows adjusting
the value without recompiling the kernel. This is useful when
recompiling is not possible as an immediate solution. When we run out
of witness objects, witness is completely disabled. Not having an
immediate solution can therefore be problematic.

Submitted by: Sreekanth Rupavatharam <rupavath@juniper.net>
Obtained from: Juniper Networks, Inc.


# 269459 03-Aug-2014 imp

Make the witness lock limit an option.


# 268351 06-Jul-2014 marcel

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# 267992 28-Jun-2014 hselasky

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 267985 27-Jun-2014 gjb

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 267961 27-Jun-2014 hselasky

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 265098 29-Apr-2014 grehan

Bump WITNESS_PENDLIST by MAXCPU to account for the
pmap pvlist locks which are scaled by MAXCPU.

This allows an amd64 system to boot with MAXCPU set
to 256, which is currently FreeBSD's hard limit without
x2apic support.

Compile-tested for other arch's.

PR: 185831
Discussed with: jhb
MFC after: 3 weeks


# 263152 14-Mar-2014 glebius

Remove AppleTalk support.

AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.


# 260898 20-Jan-2014 neel

Bump up WITNESS_COUNT from 1024 to 1536 so there are sufficient entries for
WITNESS to actually work.

Reviewed by: jhb@


# 259876 25-Dec-2013 dim

In sys/kern/subr_witness.c, remove static function
witness_lock_order_key_empty(), which is unused since r181695.

MFC after: 3 days


# 255205 04-Sep-2013 jhb

Trim a couple of panic messages.


# 254195 10-Aug-2013 kib

The r254167 moved initialization of the sleepqueues before the witness
is operational. init_sleepqueues() initializes 256 mutexes, which,
due to witness still being cold, started to overflow the pending_locks
array.

As stated in the reported panic message, increase WITNESS_PENDLIST
from 768 to 1024, which provides space for additional 256 locks.

Reported by: many
Tested by: rakuco, bdrewery


# 253007 07-Jul-2013 alfred

Make kassert_printf use __printflike.

Fix associated errors/warnings while I'm here.

Requested by: avg


# 251326 03-Jun-2013 jhb

- Fix a couple of inverted panic messages for shared/exclusive mismatches
of a lock within a single thread.
- Fix handling of interlocks in WITNESS by properly requiring the interlock
to be held exactly once if it is specified.


# 250411 09-May-2013 marcel

Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.


# 248093 09-Mar-2013 kib

Correct the lock class for the vm object lock.

Reported and tested by: joel


# 244112 11-Dec-2012 alfred

Cleanup more of the kassert_panic.

fix compile warnings on !amd64 and NULL derefs that would happen
if kassert_panic() would return.


# 244111 11-Dec-2012 alfred

Fix WITNESS when INVARIANT_SUPPORT is defined.

This fixes tinderbox breakage from r244105.

Pointed out by: adrian


# 244105 10-Dec-2012 alfred

Switch the hardwired WITNESS panics to kassert_panic.

This is an ongoing effort to provide runtime debug information
useful in the field that does not panic existing installations.

This gives us the flexibility needed when shipping images to a
potentially large audience with WITNESS enabled without worrying
about formerly non-fatal LORs hurting a release.

Sponsored by: iXsystems


# 239864 29-Aug-2012 marius

- Unlike cache invalidation and TLB demapping IPIs, reading registers from
other CPUs doesn't require locking so get rid of it. As the latter is used
for the timecounter on certain machine models, using a spin lock in this
case can lead to a deadlock with the upcoming callout(9) rework.
- Merge r134227/r167250 from x86:
Avoid cross-IPI SMP deadlock by using the smp_ipi_mtx spin lock not only
for smp_rendezvous_cpus() but also for the MD cache invalidation and TLB
demapping IPIs.
- Mark some unused function arguments as such.

MFC after: 1 week


# 239584 22-Aug-2012 jhb

Fix the 'show witness' DDB command to honor db_pager_quit.


# 237623 27-Jun-2012 alc

Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.


# 235745 21-May-2012 melifaro

Fix old panic when BPF consumer attaches to destroying interface.
'flags' field is added to the end of bpf_if structure. Currently the only
flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
Problem can be easily triggered on SMP stable/[89] by the following command (sort of):
'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'

Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory,
while interface is still UP and there can be routes via this interface.
Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api.

Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet
matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches)

Approved by: kib(mentor)
MFC in: 4 weeks


# 233937 06-Apr-2012 melifaro

- Improve BPF locking model.

Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.

- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.

- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.

Found by: Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by: Yandex LLC

Reviewed by: glebius (previous version)
Reviewed by: silence on -net@
Approved by: (mentor)

MFC after: 3 weeks


# 229873 09-Jan-2012 jhb

Convert the per-interface address list lock from a mutex to a reader/writer
lock.

Reviewed by: bz


# 228424 11-Dec-2011 avg

panic: add a switch and infrastructure for stopping other CPUs in SMP case

Historical behavior of letting other CPUs merily go on is a default for
time being. The new behavior can be switched on via
kern.stop_scheduler_on_panic tunable and sysctl.

Stopping of the CPUs has (at least) the following benefits:
- more of the system state at panic time is preserved intact
- threads and interrupts do not interfere with dumping of the system
state

Only one thread runs uninterrupted after panic if stop_scheduler_on_panic
is set. That thread might call code that is also used in normal context
and that code might use locks to prevent concurrent execution of certain
parts. Those locks might be held by the stopped threads and would never
be released. To work around this issue, it was decided that instead of
explicit checks for panic context, we would rather put those checks
inside the locking primitives.

This change has substantial portions written and re-written by attilio
and kib at various times. Other changes are heavily based on the ideas
and patches submitted by jhb and mdf. bde has provided many insights
into the details and history of the current code.

The new behavior may cause problems for systems that use a USB keyboard
for interfacing with system console. This is because of some unusual
locking patterns in the ukbd code which have to be used because on one
hand ukbd is below syscons, but on the other hand it has to interface
with other usb code that uses regular mutexes/Giant for its concurrency
protection. Dumping to USB-connected disks may also be affected.

PR: amd64/139614 (at least)
In cooperation with: attilio, jhb, kib, mdf
Discussed with: arch@, bde
Tested by: Eugene Grosbein <eugen@grosbein.net>,
gnn,
Steven Hartland <killing@multiplay.co.uk>,
glebius,
Andrew Boyer <aboyer@averesystems.com>
(various versions of the patch)
MFC after: 3 months (or never)


# 227588 16-Nov-2011 pjd

Constify arguments for locking KPIs where possible.

This enables locking consumers to pass their own structures around as const and
be able to assert locks embedded into those structures.

Reviewed by: ed, kib, jhb


# 227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 227293 07-Nov-2011 ed

Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.

This means that their use is restricted to a single C file.


# 226793 26-Oct-2011 jhb

- Fixup filenames in a few more places where they are used.
- Some whitespace fixes.


# 226294 12-Oct-2011 adrian

Don't call fixup_filename() on each witness lock call.

This has been irking me for a while. This causes significant
CPU use on bottlenecked CPUs (eg my older EEEPC w/ an earlier
Celeron CPU and my MIPS24k boards) when they're passing
a lot of traffic.

Since the file/line values are only used for printing, this
should only affect display. It should have no operational
change on the code, besides reducing CPU use.


# 218909 21-Feb-2011 brucec

Fix typos - remove duplicate "the".

PR: bin/154928
Submitted by: Eitan Adler <lists at eitanadler.com>
MFC after: 3 days


# 217916 26-Jan-2011 mdf

Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by: rwatson


# 212750 16-Sep-2010 mdf

Re-add r212370 now that the LOR in powerpc64 has been resolved:

Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte. This behaviour was preserved, though it should not be
necessary.

Reviewed by: phk (original patch)


# 212572 13-Sep-2010 mdf

Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn


# 212370 09-Sep-2010 mdf

Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.

Reviewed by: phk


# 210611 29-Jul-2010 rpaulo

Bump the witness pendlist to 768 to accomodate the increased number of
spinlocks.


# 209403 21-Jun-2010 mav

"time lock" is no longer a spin-lock since r209371.

Reported by: kib@


# 207929 11-May-2010 attilio

Right now, WITNESS just blindly pipes all the output to the
(TOCONS | TOLOG) mask even when called from DDB points.
That breaks several output, where the most notable is textdump output.
Fix this by having configurable callbacks passed to witness_list_locks()
and witness_display_spinlock() for printing out datas.

Reported by: several broken textdump outputs
Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
MFC after: 7 days
X-MFC: r207922


# 207922 11-May-2010 attilio

There is not a good reason to have a different prototype for db_printf()
when compared to printf().
Unify it by returning the number of characters displayed for db_printf()
as well.

MFC after: 7 days


# 207410 29-Apr-2010 kmacy

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 201178 29-Dec-2009 trasz

SLIP is gone; remove its mutex from witness.


# 196891 06-Sep-2009 antoine

Change w_notrunning and w_stillcold from pointer to array so that sizeof
returns what is expected.

PR: kern/138557
Discussed with: brucec@
MFC after: 1 month


# 192416 20-May-2009 kmacy

Add minimal ZFS lock hierarchy


# 191672 29-Apr-2009 bms

Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:

* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.

NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.

This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.


# 189544 08-Mar-2009 rwatson

Decompose the global UNIX domain sockets rwlock into two different
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.

This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order. It may also reduce
contention on the global lock for some workloads.

Add global UNIX domain socket locks to hard-coded witness lock
order.

MFC after: 1 week
Discussed with: kris


# 189194 28-Feb-2009 thompsa

Move the NORELEASE check to after the recurse count decrement and bailout, this
is not counted as actually releasing the lock.


# 188056 03-Feb-2009 imp

o Use unsigned for bit fields.
o Use NULL for pointers in preference to 0.


# 187511 21-Jan-2009 thompsa

Add functions WITNESS so it can be asserted that the lock is not released for a
section of code, this uses WITNESS_NORELEASE() and WITNESS_RELEASEOK() to mark
the boundaries. Both functions require the lock to be held when calling.

This is intended for scenarios like a bus asserting that the bus lock is not
dropped during a driver call. There doesn't appear to be a man page to
document this in.

Reviewed by: jhb


# 185747 07-Dec-2008 kmacy

- convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
- fix LOR in rtexpunge
- fix LOR in rtredirect

Reviewed by: sam


# 184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


# 184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# 184098 20-Oct-2008 attilio

In the actual code for witness_warn:
- If there aren't spinlocks held, but there are problems with old
sleeplocks, they are not reported.
- If the spinlock found is not the only one, problems are not reported.

Fix these 2 problems.

Reported by: tegge


# 183955 16-Oct-2008 attilio

- Fix a race in witness_checkorder() where, between the PCPU_GET() and
PCPU_PTR() curthread can migrate on another CPU and get incorrect
results.
- Fix a similar race into witness_warn().
- Fix the interlock's checks bypassing by correctly using the appropriate
children even when the lock_list chunk to be explored is not the first
one.
- Allow witness_warn() to work with spinlocks too.

Bugs found by: tegge
Submitted by: jhb, tegge
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 183574 03-Oct-2008 jhb

Oops, missed updating a place with with 's/lock1/plock/' when adding
interlock support to WITNESS. Specifically, the printf listing the
first location when duplicate locks of the same type are acquired.

Reported by: pho


# 183329 24-Sep-2008 jhb

Update description of witness_watch.


# 183054 15-Sep-2008 sam

Make ddb command registration dynamic so modules can extend
the command set (only so long as the module is present):
o add db_command_register and db_command_unregister to add and remove
commands, respectively
o replace linker sets with SYSINIT's (and SYSUINIT's) that register
commands
o expose 3 list heads: db_cmd_table, db_show_table, and db_show_all_table
for registering top-level commands, show operands, and show all operands,
respectively

While here also:
o sort command lists
o add DB_ALIAS, DB_SHOW_ALIAS, and DB_SHOW_ALL_ALIAS to add aliases
for existing commands
o add "show all trace" as an alias for "show alltrace"
o add "show all locks" as an alias for "show alllocks"

Submitted by: Guillaume Ballet <gballet@gmail.com> (original version)
Reviewed by: jhb
MFC after: 1 month


# 182984 12-Sep-2008 attilio

- For any lock list we hold the head in order to reduce allocation from
the free list and in this way avoid contention on the w_mtx.
In order to make the code simple, we rely on the rule that when the head
has not a child it also doesn't have other subsequent entries.
Actually this assertion is broken because we can free all the head
children and quit witness_unlock() with the head still allocated, with no
children and subsequent entries present.
Fix this by shifting the head if other entries are present and still
freeing the object, but leaving always an head.
- Fix witness_thread_has_locks() in order to report, correctly, if the
lock list linked to a specific thread has children or not based on the
above explained rule.
- Fix a printout into DDB's "show alllocks" command in order to show,
correctly, the process name that is really what we want.
- Fix style(9) for a comment.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reported by: Marko Kiiskila <marko dot kiiskila at nokia dot com>
Sponsored by: Nokia


# 182914 10-Sep-2008 jhb

Teach WITNESS about the interlocks used with lockmgr. This removes a bunch
of spurious witness warnings since lockmgr grew witness support. Before
this, every time you passed an interlock to a lockmgr lock WITNESS treated
it as a LOR.

Reviewed by: attilio


# 182473 30-Aug-2008 attilio

- Improve some witness_watch operability in code which does perform both
lock tracking and checks, doing just the former ones.
- Fix a bug where sysctl utility was printing crazy values when setting a
new value for debug.witness.watch [0]

[0] Reported by: yongari


# 182446 29-Aug-2008 attilio

- Make witness_watch a 3 state value.
1 means that witness is up and running.
0 means that witness is disabled but that it can be established later
again in effective way.
-1 means that witness is disabled permanently
- Fix a bug causing kernel to panic on witness disabling through
witness_watch. lock lists queues were still full of entries and this was
causing throubles with debugging stubs (like witness_thread_exit()).

Reported by: kris, yongari
Sponsored by: Nokia


# 181695 13-Aug-2008 attilio

Introduce some WITNESS improvements:
- Speedup the lock orderings lookup modifying the witness graph from a
linked tree to a matrix. A table lookup caches the lock orderings in
order to make a O(1) access for them. Any witness object has an unique
index withing this lookup cache table.
- Reduce the lock contention on w_mtx acquiring it only when the LOR
actually happens and not in a sane case. In order to do this don't totally
flush lock lists (per-CPU spinlocks list and per-thread sleeplocks list)
but check for ll_count anytime we need to have to verify allocations sanity.
- Introduce the function witness_thread_exit() in the witness namespace which
should verify a thread doesn't hold any witness occurrence why exiting.
- Rename the sysctl debug.witness.graphs into debug.witness.fullgraph and
add debug.witness.badstacks which prints out stacks for LOR revealed.
This is implemented using the stack(9) support, which makes WITNESS to be
dependent by the STACK option or by the DDB (including STACK) option.
- Fix style(9) for src/sys/kern/subr_witness.c

The hash table approach has been developed by Ilya Maykov on the behalf of
Isilon Systems which kindly released the patch.
Jeff Roberson, ported the patch to -CURRENT and fixed w_mtx contention, on the
behalf of Nokia.

Submitted by: Ilya Maykov <ivmaykov at gmail dot com> (Isilon Systems), jeff
Sponsored by: Nokia


# 180613 19-Jul-2008 rwatson

witness_addgraph() is required even if DDB isn't compiled into the kernel,
so exclude it from #ifdef DDB.

Submitted by: attilio


# 179025 15-May-2008 attilio

- Embed the recursion counter for any locking primitive directly in the
lock_object, using an unified field called lo_data.
- Replace lo_type usage with the w_name usage and at init time pass the
lock "type" directly to witness_init() from the parent lock init
function. Handle delayed initialization before than
witness_initialize() is called through the witness_pendhelp structure.
- Axe out LO_ENROLLPEND as it is not really needed. The case where the
mutex init delayed wants to be destroyed can't happen because
witness_destroy() checks for witness_cold and panic in case.
- In enroll(), if we cannot allocate a new object from the freelist,
notify that to userspace through a printf().
- Modify the depart function in order to return nothing as in the current
CVS version it always returns true and adjust callers accordingly.
- Fix the witness_addgraph() argument name prototype.
- Remove unuseful code from itismychild().

This commit leads to a shrinked struct lock_object and so smaller locks,
in particular on amd64 where 2 uintptr_t (16 bytes per-primitive) are
gained.

Reviewed by: jhb


# 178841 07-May-2008 attilio

Add a new witness sysctl which returns the relations between any lock
and its children in the form:
"parent","child"
so that head and bottom of an oriented graph can be easilly detected and
various form of diagrams can be build.
The sysctl is called debug.witness.graphs and it is read-only; in order
to get the list of relations, a simple:
#sysctl debug.witness.graphs
will do the trick.

This approach has been choosen in order to support easilly things like
the DOT format and such. Soon, an auto-explicative awk script, which
filters simple informations returned by the sysctl and converts them into
a real DOT script, will be committed to the repository between examples.

Discussed with: rwatson


# 178285 17-Apr-2008 rwatson

Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)


# 178165 12-Apr-2008 attilio

struct lock_instance and struct lock_list_entry don't need to be in the
public namespace for WITNESS as they are only used internally so just
move them in the private namespace for the subsystem (with all related
supporting definitions).


# 178159 12-Apr-2008 attilio

- Re-introduce WITNESS support for lockmgr. About the old implementation
the only one difference is that lockmgr*() functions now accept
LK_NOWITNESS flag which skips ordering for the instanced calling.
- Remove an unuseful stub in witness_checkorder() (because the above check
doesn't allow ever happening) and allow witness_upgrade() to accept
non-try operation too.


# 178149 12-Apr-2008 attilio

Add missing stubs for spinlocks cpuset and intrcnt.

Submitted by: kris


# 177299 17-Mar-2008 pjd

- There is no more "uidinfo struct" mutex.
- The "uidinfo hash" lock is now a rwlock.

Reminded by: kib


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 176771 03-Mar-2008 raj

Initial support for Freescale PowerQUICC III MPC85xx system-on-chip family.

The PQ3 is a high performance integrated communications processing system
based on the e500 core, which is an embedded RISC processor that implements
the 32-bit Book E definition of the PowerPC architecture. For details refer
to: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8555E

This port was tested and successfully run on the following members of the PQ3
family: MPC8533, MPC8541, MPC8548, MPC8555.

The following major integrated peripherals are supported:

* On-chip peripherals bus
* OpenPIC interrupt controller
* UART
* Ethernet (TSEC)
* Host/PCI bridge
* QUICC engine (SCC functionality)

This commit brings the main functionality and will be followed by individual
drivers that are logically separate from this base.

Approved by: cognet (mentor)
Obtained from: Juniper, Semihalf
MFp4: e500


# 174898 25-Dec-2007 rwatson

Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument. This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.


# 173877 24-Nov-2007 attilio

Fix the spinlock static table adding missing spinlocks.
- rm_spinlock has turnstile chain as child
- srclock has callout and clk as child, found by witness "emulation".
Just move it very high in our ranking


# 173601 14-Nov-2007 julian

A bunch more files that should probably print out a thread name
instead of a process name.


# 172256 20-Sep-2007 attilio

Fix some entries in the locks static table of witness.
In particular:
- smp_tlb_mtx is no longer used, so it is axed.
- smp rendezvous lock isn't really a leaf spin-mutex. Its bad placement in
the table, however, has been the source of a false positive LOR reporting
with the dt_lock. However, smp rendezvous lock would have had sched_lock
there for older lock, so it wasn't still a leaf lock.
- allpmaps is only used in ia32 architecture, so it is inserted in the
appropriate stub.

Addictionally:
- kse_zombie_lock is no longer present, so its definition is axed out.
- zombie_lock doesn't need to have an exported symbol, so just let's it be
declared as static.

Tested by: kris
Approved by: jeff (mentor)
Approved by: re


# 170848 16-Jun-2007 marius

- Remove zstty spin lock for no longer existing zs(4).
- Move the rtc_mtx spin lock out from under #ifdef SMP as it's just
not SMP-specific.
- Add a new spin lock pcib_mtx for locking "fast" interrupt handlers
of host-to-PCI bridge drivers on sparc64.


# 170530 11-Jun-2007 sam

Update 802.11 wireless support:
o major overhaul of the way channels are handled: channels are now
fully enumerated and uniquely identify the operating characteristics;
these changes are visible to user applications which require changes
o make scanning support independent of the state machine to enable
background scanning and roaming
o move scanning support into loadable modules based on the operating
mode to enable different policies and reduce the memory footprint
on systems w/ constrained resources
o add background scanning in station mode (no support for adhoc/ibss
mode yet)
o significantly speedup sta mode scanning with a variety of techniques
o add roaming support when background scanning is supported; for now
we use a simple algorithm to trigger a roam: we threshold the rssi
and tx rate, if either drops too low we try to roam to a new ap
o add tx fragmentation support
o add first cut at 802.11n support: this code works with forthcoming
drivers but is incomplete; it's included now to establish a baseline
for other drivers to be developed and for user applications
o adjust max_linkhdr et. al. to reflect 802.11 requirements; this eliminates
prepending mbufs for traffic generated locally
o add support for Atheros protocol extensions; mainly the fast frames
encapsulation (note this can be used with any card that can tx+rx
large frames correctly)
o add sta support for ap's that beacon both WPA1+2 support
o change all data types from bsd-style to posix-style
o propagate noise floor data from drivers to net80211 and on to user apps
o correct various issues in the sta mode state machine related to handling
authentication and association failures
o enable the addition of sta mode power save support for drivers that need
net80211 support (not in this commit)
o remove old WI compatibility ioctls (wicontrol is officially dead)
o change the data structures returned for get sta info and get scan
results so future additions will not break user apps
o fixed tx rate is now maintained internally as an ieee rate and not an
index into the rate set; this needs to be extended to deal with
multi-mode operation
o add extended channel specifications to radiotap to enable 11n sniffing

Drivers:
o ath: add support for bg scanning, tx fragmentation, fast frames,
dynamic turbo (lightly tested), 11n (sniffing only and needs
new hal)
o awi: compile tested only
o ndis: lightly tested
o ipw: lightly tested
o iwi: add support for bg scanning (well tested but may have some
rough edges)
o ral, ural, rum: add suppoort for bg scanning, calibrate rssi data
o wi: lightly tested

This work is based on contributions by Atheros, kmacy, sephe, thompsa,
mlaier, kevlo, and others. Much of the scanning work was supported by
Atheros. The 11n work was supported by Marvell.


# 170302 04-Jun-2007 jeff

Commit 10/14 of sched_lock decomposition.
- Add new spinlocks to support thread_lock() and adjust ordering.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 170110 29-May-2007 attilio

Fix some problems introduced with the last descriptors tables locking
patch:
- Do the correct test for ldt allocation
- Drop dt_lock just before to call kmem_free (since it acquires blocking
locks inside)
- Solve a deadlock with smp_rendezvous() where other CPU will wait
undefinitively for dt_lock acquisition.
- Add dt_lock in the WITNESS list of spinlocks

While applying these modifies, change the requirement for user_ldt_free()
making that returning without dt_lock held.

Tested by: marcus, tegge
Reviewed by: tegge
Approved by: jeff (mentor)


# 169803 20-May-2007 jeff

- Move clock synchronization into a seperate clock lock so the global
scheduler lock is not involved. sched_lock still protects the sched_clock
call. Another patch will remedy this.

Contributed by: Attilio Rao <attilio@FreeBSD.org>
Tested by: kris, jeff


# 168856 19-Apr-2007 jkoshy

Fix witness(4) warnings about mutex use.

Group mutexes used in hwpmc(4) into 3 "types" in the sense of
witness(4):

- leaf spin mutexes---only one of these should be held at a time,
so these mutexes are specified as belonging to a single witness
type "pmc-leaf".

- `struct pmc_owner' descriptors are protected by a spin mutex of
witness type "pmc-owner-proc". Since we call wakeup_one() while
holding these mutexes, the witness type of these mutexes needs
to dominate that of "sleepq chain" mutexes.

- logger threads use a sleep mutex, of type "pmc-sleep".

Submitted by: wkoszek (earlier patch)


# 168402 05-Apr-2007 pjd

allprison mutex was converted to sx(9) lock.


# 168355 04-Apr-2007 rwatson

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 168217 01-Apr-2007 wkoszek

ng_node and ng_worklist locks both migrated from being spinning locks to
adaptive mutexes. Let witness(4) calm down and bring proper types of those
locks to the lock order database.

Glanced at by: rwatson


# 167787 21-Mar-2007 jhb

Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes,
rwlocks, and sx locks to 'lock_object'.


# 166857 20-Feb-2007 rwatson

Remove unnecessary privilege and privilege check for WITNESS sysctl.

Head nod: jhb


# 166543 07-Feb-2007 alc

Remove the vm page queue free mutex from the CDEV order.


# 166525 06-Feb-2007 mpp

The change to the vm_page_queue_freelist lock from a spin lock to a
sleep lock missed the witness code, and the system will panic
immediately on boot if WITNESS is enabled.

Changed the witness definition to the new type.


# 166421 02-Feb-2007 kib

Record kqueue -> struct mount mtx -> vnode interlock lock order to
catch the places where reverse lock order is instantiated.

OKed by: jeff


# 166061 16-Jan-2007 ssouhlal

Remove hptlock from the static witness table, now that it's a regular sleep
mutex.


# 164159 11-Nov-2006 kmacy

MUTEX_PROFILING has been generalized to LOCK_PROFILING. We now profile
wait (time waited to acquire) and hold times for *all* kernel locks. If
the architecture has a system synchronized TSC, the profiling code will
use that - thereby minimizing profiling overhead. Large chunks of profiling
code have been moved out of line, the overhead measured on the T1 for when
it is compiled in but not enabled is < 1%.

Approved by: scottl (standing in for mentor rwatson)
Reviewed by: des and jhb


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 162285 13-Sep-2006 scottl

Introduce a spinlock for synchronizing access to the video output hardware
in syscons. This replaces a simple access semaphore that was assumed to be
protected by Giant but often was not. If two threads that were otherwise
SMP-safe called printf at the same time, there was a high likelyhood that
the semaphore would get corrupted and result in a permanently frozen video
console. This is similar to what is already done in the serial console
drivers.


# 161638 26-Aug-2006 ssouhlal

The "taskqueue_fast" spinlocks were renamed to "fast_taskqueue" in
subr_taskqueue.c:r1.32

Reported by: rdivacky


# 158030 25-Apr-2006 jhb

Use db_lookup_thread() to lookup the thread for the passed in address
and change 'show locks' to only list the locks for a given thread
rather than for all the threads in the process containing a specified
thread.


# 158026 25-Apr-2006 marius

Remove last vestiges of sab(4).


# 157584 07-Apr-2006 marcel

Add the scc_hwmtx spin mutex, defined by scc(4).


# 154818 25-Jan-2006 jhb

Axe KTR_ALQ_MASK now that KTR_WITNESS is off unless you hack an #ifdef
in subr_witness.c. I did add a comment in subr_witness.c noting that
KTR_WITNESS is incompatible with KTR_ALQ.


# 154790 24-Jan-2006 jhb

- Add a new KTR_SUBSYS in place of KTR_SPARE1 to serve as a subsystem
placeholder similar to KTR_DEV. Explain the use of KTR_DEV and
KTR_SUBSYS in a comment as well.
- Retire KTR_WITNESS and instead have KTR_WITNESS default to off but use
KTR_SUBSYS if it is enabled.


# 154484 17-Jan-2006 jhb

Add a new file (kern/subr_lock.c) for holding code related to struct
lock_obj objects:
- Add new lock_init() and lock_destroy() functions to setup and teardown
lock_object objects including KTR logging and registering with WITNESS.
- Move all the handling of LO_INITIALIZED out of witness and the various
lock init functions into lock_init() and lock_destroy().
- Remove the constants for static indices into the lock_classes[] array
and change the code outside of subr_lock.c to use LOCK_CLASS to compare
against a known lock class.
- Move the 'show lock' ddb function and lock_classes[] array out of
kern_mutex.c over to subr_lock.c.


# 154077 06-Jan-2006 jhb

Trim another pointer from struct lock_object (and thus from struct mtx and
struct sx). Instead of storing a direct pointer to a our lock_class
struct in lock_object, reserve 4 bits in the lo_flags field to serve as an
index into a global lock_classes array that contains pointers to the lock
classes. Only debugging code such as WITNESS or INVARIANTS checks and KTR
logging need to access the lock_class member, so this shouldn't add any
overhead to production kernels. It might add some slight overhead to
kernels using those debug options however.

As with the previous set of changes to lock_object, this is going to
completely obliterate the kernel ABI, so be sure to recompile all your
modules.


# 153854 29-Dec-2005 jhb

Teach WITNESS_SAVE() and WITNESS_RESTORE() to work with spin locks instead
of only sleep locks.


# 153853 29-Dec-2005 jhb

Fix a deadlock I introduced with the recently added printf to warn about
spin locks that are not in the static order list. It is not safe to call
printf while holding the witness spin mutex since the console drivers that
back printf may need to use their own spin locks which would try to talk
to witness when they were locked. Given this, it is possible for one
CPU to lock a console driver lock (such as sio) which then tries to lock
the witness lock while another CPU is doing the printf while holding the
witness lock. Fix this by moving the printf outside of the witness lock.
All other printf's in witness are already correct.

MFC after: 3 days


# 153133 05-Dec-2005 jhb

Tweak witness handling of lock object to shave 2 pointers off of each
lock object (and thus off of each mutex and sx lock):
- Rename the all_locks list to pending_locks and only put locks initialized
before SI_SUB_WITNESS on the list so that the SI_SUB_WITNESS can add them
to witness once it starts up.
- Now that pending_locks is only used during early startup, change it from
a TAILQ to an STAILQ. This removes a pointer from the STAILQ_ENTRY in
struct lock_object.
- Since the pending_locks list is only used during the single-threaded
early boot it no longer needs to be protected by a mutex, so remove
all_mtx.
- Since the lo_list member of struct lock_object is now only used during
early boot before witness is running, collapse lo_list and lo_witness
into a union. This shaves the second pointer off of struct lock_object.
- Axe lock_cur_cnt and lock_max_cnt.

With these changes, struct mtx shrinks from 36 to 28 bytes on 32-bit
platforms and from 72 to 56 bytes on 64-bit platforms. Note that this
commit will completely and utterly destroy the kernel ABI, so no MFC.

Tested on: alpha, amd64, i386, sparc64


# 151658 25-Oct-2005 jhb

Reorganize the interrupt handling code a bit to make a few things cleaner
and increase flexibility to allow various different approaches to be tried
in the future.
- Split struct ithd up into two pieces. struct intr_event holds the list
of interrupt handlers associated with interrupt sources.
struct intr_thread contains the data relative to an interrupt thread.
Currently we still provide a 1:1 relationship of events to threads
with the exception that events only have an associated thread if there
is at least one threaded interrupt handler attached to the event. This
means that on x86 we no longer have 4 bazillion interrupt threads with
no handlers. It also means that interrupt events with only INTR_FAST
handlers no longer have an associated thread either.
- Renamed struct intrhand to struct intr_handler to follow the struct
intr_foo naming convention. This did require renaming the powerpc
MD struct intr_handler to struct ppc_intr_handler.
- INTR_FAST no longer implies INTR_EXCL on all architectures except for
powerpc. This means that multiple INTR_FAST handlers can attach to the
same interrupt and that INTR_FAST and non-INTR_FAST handlers can attach
to the same interrupt. Sharing INTR_FAST handlers may not always be
desirable, but having sio(4) and uhci(4) fight over an IRQ isn't fun
either. Drivers can always still use INTR_EXCL to ask for an interrupt
exclusively. The way this sharing works is that when an interrupt
comes in, all the INTR_FAST handlers are executed first, and if any
threaded handlers exist, the interrupt thread is scheduled afterwards.
This type of layout also makes it possible to investigate using interrupt
filters ala OS X where the filter determines whether or not its companion
threaded handler should run.
- Aside from the INTR_FAST changes above, the impact on MD interrupt code
is mostly just 's/ithread/intr_event/'.
- A new MI ddb command 'show intrs' walks the list of interrupt events
dumping their state. It also has a '/v' verbose switch which dumps
info about all of the handlers attached to each event.
- We currently don't destroy an interrupt thread when the last threaded
handler is removed because it would suck for things like ppbus(8)'s
braindead behavior. The code is present, though, it is just under
#if 0 for now.
- Move the code to actually execute the threaded handlers for an interrrupt
event into a separate function so that ithread_loop() becomes more
readable. Previously this code was all in the middle of ithread_loop()
and indented halfway across the screen.
- Made struct intr_thread private to kern_intr.c and replaced td_ithd
with a thread private flag TDP_ITHREAD.
- In statclock, check curthread against idlethread directly rather than
curthread's proc against idlethread's proc. (Not really related to intr
changes)

Tested on: alpha, amd64, i386, sparc64
Tested on: arm, ia64 (older version of patch by cognet and marcel)


# 151629 24-Oct-2005 jhb

Don't panic if a spin lock is initialized that isn't in our static order
list. Just warn about it instead.

Requested by: scottl
MFC after: 1 day


# 151623 24-Oct-2005 jhb

Spell hierarchy correctly in comments.

Submitted by: Wojciech A. Koszek dunstan at freebsd dot czest dot pl


# 151512 20-Oct-2005 jhb

Add entry for the spin mutex used by the hptmv(4) driver.

MFC after: 1 day
Tested by: Philip Kizer pckizer at nostrum dot com


# 150582 26-Sep-2005 jhb

Add the spin lock used by the binary nvidia driver to the static lock
order list so that WITNESS and the driver play together nicely.

Tested by: Harald Schmalzbauer
MFC after: 3 days


# 150179 15-Sep-2005 jhb

- Enforce an implicit lock order that Giant cannot be locked while holding
any other non-sleepable lock. In plain English: Giant comes before all
other mutexes.
- Add some extra description to the lock order reversal printf's to indicate
when a reversal is triggered by a hard-coded implicit rule.

Requested by: truckman (2)
MFC after: 1 week


# 149979 11-Sep-2005 truckman

Relocate witness_levelall(), witness_leveldescendents(), and
witness_displaydescendants() so that they are protected by
"#ifdef DDB/#endif" to unbreak kernels not using "option DDB".

MFC after: 3 weeks


# 149738 02-Sep-2005 jhb

- Add some comments to some of the static lock orders. Don't explicitly
link proctree and allproc to Giant since that order is already implicitly
enforced.
- Use a goto to handle the case where we want to enforce a reversal before
calling isitmydescendant() in witness_checkorder() so that the logic is
easier to follow and so that it is easier to add more forced-reversal
cases in the future.

MFC after: 3 days


# 149441 25-Aug-2005 truckman

Track all lock relationships instead of pruning direct relationships
if an indirect relationship exists (keep both A->B->C and A->C).
This allows witness_checkorder() to use isitmychild() instead of
the much more expensive isitmydescendant() to check for valid lock
ordering.

Don't do an expensive tree walk to update the w_level values when
the tree is updated. Only update the w_level values when using the
debugger to display the tree.

Nuke the experimental "witness_watch > 1" mode that only compared
w_level for the two locks. This information is no longer maintained
at run time, and the use of isitmychild() in witness_checkorder
should bring performance close enough to the acceptable level that
this hack is not needed.

Report witness data structure allocation statistics under the
debug.witness sysctl.

Reviewed by: jhb
MFC after: 30 days


# 148896 09-Aug-2005 rwatson

Add an order between UDP inpcb locks and the IPv4 multicast address
list lock, as there has been a report that an alternative lock order
is getting introduced. This should help ferret it out.

Reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>


# 148682 03-Aug-2005 rwatson

Introduce in_multi_mtx, which will protect IPv4-layer multicast address
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.

For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.

Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().

Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.

Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.

Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.

Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.

Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days


# 146982 04-Jun-2005 marius

After some input from bde@ and rereading the datasheet use a MTX_SPIN
mutex instead of a MTX_DEF one in order to defer preemption while
reading the date and time registers. If we don't manage to read them
within the time slot where we are guaranteed that no updates occur we
might actually read them during an update in which case the output is
undefined.


# 145425 22-Apr-2005 jeff

- Define the real lock order with cdev and a few vm/vfs related locks. This
can be removed once cdev no longer calls free() with the cdev lock held.


# 145422 22-Apr-2005 jeff

- Check LO_DUPOK as well as LOP_DUPOK when determining whether we should
warn about duplicate acquires.

Sponsored by: Isilon Systems, Inc.


# 144966 12-Apr-2005 vkashyap

The latest release of the FreeBSD driver (twa) for
3ware's 9xxx series controllers. This corresponds to
the 9.2 release (for FreeBSD 5.2.1) on the 3ware website.

Highlights of this release are:

1. The driver has been re-architected to use a "Common Layer"
(all tw_cl* files), which is a consolidation of all OS-independent
parts of the driver. The FreeBSD OS specific portions of the
driver go into an "OS Layer" (all tw_osl* files).
This re-architecture is to achieve better maintainability, consistency
of behavior across OS's, and better portability to new OS's (drivers
for new OS's can be written by just adding an OS Layer that's specific
to the OS, by complying to a "Common Layer Programming Interface" API.

2. The driver takes advantage of multiple processors.

3. The driver has a new firmware image bundled, the new features of which
include Online Capacity Expansion and multi-lun support, among others.
More details about 3ware's 9.2 release can be found here:
http://www.3ware.com/download/Escalade9000Series/9.2/9.2_Release_Notes_Web.pdf

Since the Common Layer is used across OS's, the FreeBSD specific include
path for header files (/sys/dev/twa) is not part of the #include pre-processor
directive in any of the source files. For being able to integrate twa into
the kernel despite this, Makefile.<arch> has been changed to add the include
path to CFLAGS.

Reviewed by: scottl


# 144836 09-Apr-2005 pjd

CDEV lock should be before 'system map' lock.
Hardcode this order to help track down reported LOR.

LOR reported by: Thierry Herbelot <thierry@herbelot.com>
LOR info: http://sources.zabbadoz.net/freebsd/lor.html#080


# 144832 09-Apr-2005 pjd

Add a missing terminator.

Confirmed by: rwatson


# 143335 09-Mar-2005 rwatson

Document, via WITNESS, that the NFS server mutex falls ahead of the socket
buffer mutexes.


# 143204 07-Mar-2005 wpaul

When you call MiniportInitialize() for an 802.11 driver, it will
at some point result in a status event being triggered (it should
be a link down event: the Microsoft driver design guide says you
should generate one when the NIC is initialized). Some drivers
generate the event during MiniportInitialize(), such that by the
time MiniportInitialize() completes, the NIC is ready to go. But
some drivers, in particular the ones for Atheros wireless NICs,
don't generate the event until after a device interrupt occurs
at some point after MiniportInitialize() has completed.

The gotcha is that you have to wait until the link status event
occurs one way or the other before you try to fiddle with any
settings (ssid, channel, etc...). For the drivers that set the
event sycnhronously this isn't a problem, but for the others
we have to pause after calling ndis_init_nic() and wait for the event
to arrive before continuing. Failing to wait can cause big trouble:
on my SMP system, calling ndis_setstate_80211() after ndis_init_nic()
completes, but _before_ the link event arrives, will lock up or
reset the system.

What we do now is check to see if a link event arrived while
ndis_init_nic() was running, and if it didn't we msleep() until
it does.

Along the way, I discovered a few other problems:

- Defered procedure calls run at PASSIVE_LEVEL, not DISPATCH_LEVEL.
ntoskrnl_run_dpc() has been fixed accordingly. (I read the documentation
wrong.)

- Similarly, the NDIS interrupt handler, which is essentially a
DPC, also doesn't need to run at DISPATCH_LEVEL. ndis_intrtask()
has been fixed accordingly.

- MiniportQueryInformation() and MiniportSetInformation() run at
DISPATCH_LEVEL, and each request must complete before another
can be submitted. ndis_get_info() and ndis_set_info() have been
fixed accordingly.

- Turned the sleep lock that guards the NDIS thread job list into
a spin lock. We never do anything with this lock held except manage
the job list (no other locks are held), so it's safe to do this,
and it's possible that ndis_sched() and ndis_unsched() can be
called from DISPATCH_LEVEL, so using a sleep lock here is
semantically incorrect. Also updated subr_witness.c to add the
lock to the order list.


# 140637 22-Jan-2005 rwatson

When DDB is not defined, don't implement witness_thread_has_locks() and
witness_proc_has_locks(), as they are unused, which results in a compiler
error. This problem was introduced with the implementation of "show
alllocks".

Spotted by: Artem Kuchin <matrix at itlegion dot ru>


# 139378 28-Dec-2004 jhb

- Up the WITNESS_COUNT macro from 200 to 1024 to support the growing number
of lock types in the kernel. This results in an increase of witness
data usage from ~145k to ~280k on i386 for kernels with
'options WITNESS'.
- Remove the unused witness malloc bucket.

Submitted by: Michal Mertl mime at traveller dot cz (1)


# 139348 27-Dec-2004 rwatson

Attempt to slightly refine the print out from "show alllocks" -- list
the process and thread numbers/names on the same line rather than on
separate lines, and print the thread pointer not just the tid.


# 139333 26-Dec-2004 rwatson

Add "show alllocks" command to DDB, which dumps a list of processes
and threads currently holding sleep mutexes (and spin mutexes for
curthread). This can be quite useful in looking for a lock condition
summary for a system, as it avoids manually iterating through threads
and processes to find all the interesting locks.

NB: "alllocks" is up there with "lockedvnods" for a bad argument for
show.

MFC after: 2 weeks


# 137442 09-Nov-2004 jmg

clean up some tunables that should of been removed a while ago...


# 136374 11-Oct-2004 rwatson

Add entropy harvest mutex to hard-coded spin lock witness lock order,
remove previous entropy harvesting mutex names as they are no longer
present. Commit to this file was ommitted when randomdev_soft.c:1.5
was made.

Feet shot: Robert Huff <roberthuff at rcn dot com>


# 136304 09-Oct-2004 green

Don't "implicitly order all sleep locks before spin locks" in witness
when the spin lock in question isn't -- it's the critical_enter() that
KDB set. No more panic in DDB for console -> syscons -> tty -> knote
operations.


# 134971 09-Sep-2004 rwatson

Hard code witness lock order for BPF locks.


# 134873 06-Sep-2004 jmg

make witness it's own sysctl branch instead of using _ to do this. I have
left the old tunables in to give people a few days to transition their
loader.conf and sysctl.conf's over to the new names..

MFC after: 5 days


# 133139 04-Aug-2004 jhb

Remove a potential deadlock on i386 SMP by changing the lazypmap ipi and
spin-wait code to use the same spin mutex (smp_tlb_mtx) as the TLB ipi
and spin-wait code snippets so that you can't get into the situation of
one CPU doing a TLB shootdown to another CPU that is doing a lazy pmap
shootdown each of which are waiting on each other. With this change, only
one of the CPUs would do an IPI and spin-wait at a time.


# 132639 25-Jul-2004 rwatson

Add netatalk mutexes to hard-coded WITNESS lock order.


# 131930 10-Jul-2004 marcel

Update for the KDB framework:
o Make debugging code conditional upon KDB instead of DDB.
o s/WITNESS_DDB/WITNESS_KDB/g
o s/witness_ddb/witness_kdb/g
o Rename the debug.witness_ddb sysctl to debug.witness_kdb.
o Call kdb_backtrace() instead of backtrace().
o Call kdb_enter() instead Debugger().
o Assert kdb_active instead of db_active.


# 131884 09-Jul-2004 jhb

Check the lock lists to see if they are empty directly rather than
assigning a pointer to the list and then dereferencing the pointer as a
second step. When the first spin lock is acquired, curthread is not in
a critical section so it may be preempted and would end up using another
CPUs lock list instead of its own.

When this code was in witness_lock() this sequence was safe as curthread
was in a critical section already since witness_lock() is called after the
lock is acquired.

Tested by: Daniel Lang dl at leo.org


# 130396 12-Jun-2004 rwatson

Introduce socket and UNIX domain socket locks into hard-coded lock
order definition for witness. Send lock before receive lock, and
socket locks after accept but before select:

filedesc -> accept -> so_snd -> so_rcv -> sellck

All routing locks after send lock:

so_rcv -> radix node head

All protocol locks before socket locks:

unp -> so_snd
udp -> udpinp -> so_snd
tcp -> tcpinp -> so_snd


# 130031 03-Jun-2004 jhb

- Comment out NULL, NULL barrier for Unix domain sockets section as the
double NULL entries signal Witness to stop processing the array of
order entries meaning none of the spin locks are added resulting in
panics on boot.
- Add a missing NULL, NULL terminator to the Slip locks list to keep them
separate from the spin locks.


# 130022 02-Jun-2004 rwatson

Expand the hard-coded WITNESS lock order to include the following
relationships:

Sockets: filedesc->accept->sellck
Routing: radix node head->rtentry->ifaddr
UDP: udp->udpinp
TCP: tcp->tcpinp
SLIP: slip_mtx->slip sc_mtx

Drop in a place holder section for UNIX domain sockets. Various
sections to be expanded over the next few days.


# 127323 22-Mar-2004 alfred

Emit a traceback when witness_trace is set and witness_warn() is
called and triggers (typically caused by sleeping with a non-sleepable
lock).

Reviewed by: jhb


# 126324 27-Feb-2004 jhb

Add an implementation of a generic sleep queue abstraction that is used
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock. This subsystem will be used as
the backend for sleep/wakeup and condition variables initially. Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.

Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler. Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations. For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.


# 125393 03-Feb-2004 jhb

Remove a bogus assertion.

Noticed by: bde
Pointy hat to: jhb


# 125348 02-Feb-2004 jhb

- Assert that witness_cold is not true in enroll().
- Only check witness_watch once in enroll().

Reported by: ru (2)


# 125160 28-Jan-2004 jhb

Rework witness_lock() to make it slightly more useful and flexible.
- witness_lock() is split into two pieces: witness_checkorder() and
witness_lock(). Witness_checkorder() determines if acquiring a specified
lock at the time it is called would result in a lock order. It
optionally adds a new lock order relationship as well. witness_lock()
updates witness's data structures to assume that a lock has been acquired
by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
acquire a lock and continue to call witness_lock() after the acquire is
completed. This will let witness catch a deadlock before it happens
rather than trying to do so after the threads have deadlocked (i.e. never
actually report it).
- A new function witness_defineorder() has been added that adds a lock
order between two locks at runtime without having to acquire the locks.
If the lock order cannot be added it will return an error. This function
is available to programmers via the WITNESS_DEFINEORDER() macro which
accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
witness_checkorder() anywhere as a way of enforcing locking assertions
in code that might acquire a certain lock in some situations. The
macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
was unnested by using a goto to vastly improve the readability of this
function.


# 124972 25-Jan-2004 ru

Register the uart(4)'s spin lock with witness(4).


# 122917 20-Nov-2003 markm

Fix a major faux pas of mine. I was causing 2 very bad things to
happen in interrupt context; 1) sleep locks, and 2) malloc/free
calls.

1) is fixed by using spin locks instead.

2) is fixed by preallocating a FIFO (implemented with a STAILQ)
and using elements from this FIFO instead. This turns out
to be rather fast.

OK'ed by: re (scottl)
Thanks to: peter, jhb, rwatson, jake
Apologies to: *


# 122849 17-Nov-2003 peter

Initial landing of SMP support for FreeBSD/amd64.

- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
(in particular, bootstrap)
- This is still a WIP. It seems that there are some highly bogus bioses
on nVidia nForce3-150 boards. I can't stress how broken these boards
are. I have a workaround in mind, but right now the Asus SK8N is broken.
The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE. SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
atpic' in addition, because they somehow managed to forget to connect the
8254 timer to the apic, even though its in the same silicon! ARGH!
This directly violates the ACPI spec.


# 122771 15-Nov-2003 bde

Localized the cy driver's locking.


# 122514 11-Nov-2003 jhb

Add an implementation of turnstiles and change the sleep mutex code to use
turnstiles to implement blocking isntead of implementing a thread queue
directly. These turnstiles are somewhat similar to those used in Solaris 7
as described in Solaris Internals but are also different.

Turnstiles do not come out of a fixed-sized pool. Rather, each thread is
assigned a turnstile when it is created that it frees when it is destroyed.
When a thread blocks on a lock, it donates its turnstile to that lock to
serve as queue of blocked threads. The queue associated with a given lock
is found by a lookup in a simple hash table. The turnstile itself is
protected by a lock associated with its entry in the hash table. This
means that sched_lock is no longer needed to contest on a mutex. Instead,
sched_lock is only used when manipulating run queues or thread priorities.
Turnstiles also implement priority propagation inherently.

Currently turnstiles only support mutexes. Eventually, however, turnstiles
may grow two queue's to support a non-sleepable reader/writer lock
implementation. For more details, see the comments in sys/turnstile.h and
kern/subr_turnstile.c.

The two primary advantages from the turnstile code include: 1) the size
of struct mutex shrinks by four pointers as it no longer stores the
thread queue linkages directly, and 2) less contention on sched_lock in
SMP systems including the ability for multiple CPUs to contend on different
locks simultaneously (not that this last detail is necessarily that much of
a big win). Note that 1) means that this commit is a kernel ABI breaker,
so don't mix old modules with a new kernel and vice versa.

Tested on: i386 SMP, sparc64 SMP, alpha SMP


# 122001 03-Nov-2003 jhb

Update spin lock order list for new i386 interrupt and SMP code.


# 121307 21-Oct-2003 silby

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


# 119813 06-Sep-2003 sam

add fast swi taskqueue spinlock to the order_list so witness doesn't complain

Submitted by: Tor Egge <Tor.Egge@cvsup.no.freebsd.org>


# 118441 04-Aug-2003 jhb

Insert cosmetic spaces.

Reported by: kris


# 118271 31-Jul-2003 jhb

Add a new function to look for a spinlock's instance when it is held by
another thread. We use the td_oncpu member of the other field to locate
it's associated CPU and then search the that CPU's list of spin locks
contained in its per-CPU data. This is not always safe and may in fact
panic or just not work, but it is useful in at least one case.


# 117372 09-Jul-2003 peter

unifdef -DLAZY_SWITCH and start to tidy up the associated glue.


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 115537 31-May-2003 phk

Remove return after panic.

Found by: FlexeLint


# 115425 31-May-2003 peter

Add __amd64__ to the ifdefs that introduce the "pcicfg" spinlock to
witness.

Approved by: re (safe amd64 support)


# 113339 10-Apr-2003 julian

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 113275 09-Apr-2003 mike

o In struct prison, add an allprison linked list of prisons (protected
by allprison_mtx), a unique prison/jail identifier field, two path
fields (pr_path for reporting and pr_root vnode instance) to store
the chroot() point of each jail.
o Add jail_attach(2) to allow a process to bind to an existing jail.
o Add change_root() to perform the chroot operation on a specified
vnode.
o Generalize change_dir() to accept a vnode, and move namei() calls
to callers of change_dir().
o Add a new sysctl (security.jail.list) which is a group of
struct xprison instances that represent a snapshot of active jails.

Reviewed by: rwatson, tjr


# 112993 02-Apr-2003 peter

Commit a partial lazy thread switch mechanism for i386. it isn't as lazy
as it could be and can do with some more cleanup. Currently its under
options LAZY_SWITCH. What this does is avoid %cr3 reloads for short
context switches that do not involve another user process. ie: we can
take an interrupt, switch to a kthread and return to the user without
explicitly flushing the tlb. However, this isn't as exciting as it could
be, the interrupt overhead is still high and too much blocks on Giant
still. There are some debug sysctls, for stats and for an on/off switch.

The main problem with doing this has been "what if the process that you're
running on exits while we're borrowing its address space?" - in this case
we use an IPI to give it a kick when we're about to reclaim the pmap.

Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a
few more things and get some more feedback before turning it on by default.

This is NOT a replacement for Bosko's lazy interrupt stuff. This was more
meant for the kthread case, while his was for interrupts. Mine helps a
little for interrupts, but his helps a lot more.

The stats are enabled with options SWTCH_OPTIM_STATS - this has been a
pseudo-option for years, I just added a bunch of stuff to it.

One non-trivial change was to select a new thread before calling
cpu_switch() in the first place. This allows us to catch the silly
case of doing a cpu_switch() to the current process. This happens
uncomfortably often. This simplifies a bit of the asm code in cpu_switch
(no longer have to call choosethread() in the middle). This has been
implemented on i386 and (thanks to jake) sparc64. The others will come
soon. This is actually seperate to the lazy switch stuff.

Glanced at by: jake, jhb


# 112562 24-Mar-2003 jhb

- Remove witness_dead and just use witness_watch instead. If witness_watch
is set to 0, it now has the same affect as setting witness_dead used to
have.
- Added a sysctl handler that allows root to change witness_watch from a
non-zero value to zero to disable witness at runtime. Note that you
can't turn witness back on once it is off. You can only turn it off as
a one-way switch.
- Added a comment describing the possible values of witness_watch.


# 112123 11-Mar-2003 jhb

Trim an extra blank line that snuck into the last commit.


# 112118 11-Mar-2003 jhb

- Change witness_displaydescendants() to accept the indentation level as
a parameter instead of using the level of a given witness. When
recursing, pass an indent level of indent + 1.
- Make use of the information witness_levelall() provides in
witness_display_list() to use an O(n) algorithm instead of an O(n^2)
algo to decide which witnesses to display hierarchies from. Basically,
we only display a hierarchy for witnesses with a level of 0.
- Add a new per-witness flag that is reset at the start of
witness_display() for all witness's and is set the first time a witness
is displayed in witness_displaydescendants(). If a witness is
encountered more than once in the lock order tree (which happens often),
witness_displaydescendants() marks the later occurrences with the string
"(already displayed)" and doesn't display the subtree under that
witness. This avoids duplicating large amounts of the lock order tree
in the 'show witness' output in DDB.

All these changes serve to make 'show witness' a lot more readable and
useful than it was previously.


# 112117 11-Mar-2003 jhb

- Split the itismychild() function into two functions: insertchild()
adds a witness to the child list of a parent witness. rebalancetree()
runs through the entire tree removing direct descendants of witnesses
who already have said child witness as an indirect descendant through
another direct descendant. itismychild() now calls insertchild()
followed by rebalancetree() and no longer needs the evil hack of
having static recursed variable.
- Add a function reparentchildren() that adds all the direct descendants
of one witness as direct descendants of another witness.
- Change the return value of itismychild() and similar functions so that
they return 0 in the case of failure due to lack of resources instead
of 1. This makes the return value more intuitive.
- Check the return value of itismychild() when defining the static lock
order in witness_initialize().
- Don't try to setup a lock instance in witness_lock() if itismychild()
fails. Witness is hosed anyways so no need to do any more witness
related activity at that point. It also makes the code flow easier to
understand.
- Add a new depart() function as the opposite of enroll(). When the
reference count of a witness drops to 0 in witness_destroy(), this
function is called on that witness. First, it runs through the
lock order tree using reparentchildren() to reparent direct descendants
of the departing witness to each of the witness' parents in the tree.
Next, it releases it's own child list and other associated resources.
Finally it calls rebalanacetree() to rebalance the lock order tree.
- Sort function prototypes into something closer to alphabetical order.

As a result of these changes, there should no longer be 'dead' witnesses
in the order tree, and repeatedly loading and unloading a module should no
longer exhaust witness of its internal resources.

Inspired by: gallatin


# 112116 11-Mar-2003 jhb

Trim useless "../" leading strings from filenames passed into witness.


# 112115 11-Mar-2003 jhb

Adjust style of #ifdef's and #endif's to be more consistent and in line
with recent additions to style(9).


# 112112 11-Mar-2003 jhb

Do the lock order check skip for the LOP_TRYLOCK case after the check for
recursing on a lock instead of before. This fixes a bug where WITNESS
could get a little confused if you did an sx_tryslock() on a sx lock that
you already had an slock on. WITNESS would still function correctly but
it could result in weirdness in the output of 'show locks'. This also
makes it possible for mtx_trylock() to recurse on a lock.


# 112061 10-Mar-2003 jhb

Now that we have WITNESS_WARN(), we only call witness_list() from the
ddb 'show locks' command. Thus, move witness_list() to the #ifdef DDB
section and remove extra checks for calling this function outside of
DDB. Also, witness_list() now returns void instead of returning an int.

Reported by: Steve Ames <steve@energistic.com>
Prodded by: davidxu


# 111951 06-Mar-2003 jhb

Oops, fix the double faults people were seeing with the recent changes to
witness. Sleepable locks such as sx locks always come before all mutexes
including Giant. However, the static lock order list placed Giant before
the proctree and allproc sx locks. This resulted in witness creating a
cycle in its lock order "tree" (real trees don't have cycles) leading to
infinite recursion and eventually a double fault. To fix, put Giant after
sx locks in the lock order list.


# 111887 04-Mar-2003 jhb

Bah, fix a bogon in the last commit: get the sense of a compare test right
so that we allow a sleepable lock to be acquired with Giant held rather
than allowing a sleepable lock to be acquired with anything but Giant held.


# 111881 04-Mar-2003 jhb

A small overhaul of witness:
- Add a comment about special lock order rules and Giant near the top of
subr_witness.c. Specifically, this documents and explains the real lock
order relationship between Giant and sleepable locks (i.e. lockmgr locks
and sx locks). Basically, Giant can be safely acquired either before or
after sleepable locks and the case of Giant before a sleepable lock is
exempted as a special case.
- Add a new static function 'witness_list_lock()' that displays a single
line of information about a struct lock_instance. This is used to
make the output of witness messages more consistent and reduce some code
duplication.
- Fixup a few comments in witness_lock().
- Properly handle the Giant-before-sleepable-lock lock order exception in
a more general fashion and remove the no longer needed LI_SLEPT flag.
- Break up the last condition before assuming a reversal a bit to try
and make the logic less confusing in witness_lock().
- Axe WITNESS_SLEEP() now that LI_SLEPT is no longer needed and replace it
with a more general WITNESS_WARN() macro/function combination.
WITNESS_WARN() allows you to output a customized message out to the
console along with a list of held locks. It will optionally drop into
the debugger as well. You can exempt a single lock from the check by
passing it in as the second argument. You can also use flags to specify
if Giant should be exempt from the check, if all sleepable locks should
be exempt from the check, and if witness should panic if any non-exempt
locks are found.
- Make the witness_list() function static. Other areas of the kernel
should use the new WITNESS_WARN() instead.


# 111068 18-Feb-2003 peter

Initiate de-orbit burn for USE_PCI_BIOS_FOR_READ_WRITE. This has been
#if'ed out for a while. Complete the deed and tidy up some other bits.

We need to be able to call this stuff from outer edges of interrupt
handlers for devices that have the ISR bits in pci config space. Making
the bios code mpsafe was just too hairy. We had also stubbed it out some
time ago due to there simply being too much brokenness in too many systems.
This adds a leaf lock so that it is safe to use pci_read_config() and
pci_write_config() from interrupt handlers. We still will use pcibios
to do interrupt routing if there is no acpi.. [yes, I tested this]

Briefly glanced at by: imp


# 111028 17-Feb-2003 jeff

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 110779 12-Feb-2003 peter

Add a 'debug.witness_trace' sysctl (and tunable) when DDB is present.
This causes LOR and could-sleep messages to come with a stack trace.


# 110190 01-Feb-2003 julian

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 109877 26-Jan-2003 davidxu

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 109015 09-Jan-2003 jake

Oops, add zstty to the witness order list.

Noticed by: benno


# 108187 22-Dec-2002 jake

- Add a spin lock to single thread cache invalidation and tlb flush ipis,
which allows ipis to be sent outside of Giant.
- Remove the ap boot mutex, which is unused.


# 108184 22-Dec-2002 kris

Enforce correct ordering of the filedesc structure and pipe mutex, because
WITNESS can get the order wrong if it guesses based on first use.

Reviewed by: jhb, alfred


# 106781 11-Nov-2002 jhb

Correct an assertion in the code to traverse the list of locks to find an
earlier acquired lock with the same witness as the lock currently being
acquired. If we had released several earlier acquired locks after
acquiring enough locks to require another lock_list_entry bucket in the
lock list, then subsequent lock_list_entry buckets could contain only one
lock instance in which case i would be zero.

Reported by: Joel M. Baldwin <qumqats@outel.org>


# 106360 02-Nov-2002 alc

Catch up with the removal of the vm page buckets spin mutex.


# 105508 20-Oct-2002 phk

#unifdef the code for checking blessed lock collisions until we need it.

Spotted by: DARPA & NAI Labs.


# 104951 11-Oct-2002 peter

Register the machine check private state spinlock on ia64.


# 104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 103786 22-Sep-2002 jeff

- Tell witness about ALQ's spin lock.


# 103091 08-Sep-2002 jake

Make this driver work a whole lot better.
- Get the initial mode from the prom settings and don't clobber the mode
on open.
- Copy output into an internal ring buffer instead of accessing the tty
outq directly in the interrupt handler. This fixes a problem where
garbage would show up in the output stream.
- Reset the console port completely and reprogram all the parameters
before enabling it. This fixes seemingly random hangs on startup
when using a fast interrupt handler.
- Add minimal locking in place of spls.
- Remove dead code and minor cleanups.


# 102448 26-Aug-2002 iedowse

Add WITNESS_FILE() and WITNESS_LINE(), which allow users of witness
to print out the file and line from the lock object. These will be
used shortly by CTR() calls in the mutex code.

Reviewed by: jhb, jake


# 100011 15-Jul-2002 mp

Silence compiler warnings when DDB is not defined.

PR: 36002
Submitted by: Yoshikazu GOTO <goto@snowy.to>


# 99862 12-Jul-2002 peter

Revive backed out pmap related changes from Feb 2002. The highlights are:
- It actually works this time, honest!
- Fine grained TLB shootdowns for SMP on i386. IPI's are very expensive,
so try and optimize things where possible.
- Introduce ranged shootdowns that can be done as a single IPI.
- PG_G support for i386
- Specific-cpu targeted shootdowns. For example, there is no sense in
globally purging the TLB cache for where we are stealing a page from
the local unshared process on the local cpu. Use pm_active to track
this.
- Add some instrumentation for the tlb shootdown code.
- Rip out SMP code from <machine/cpufunc.h>
- Try and fix some very bogus PG_G and PG_PS interactions that were bad
enough to cause vm86 bios calls to break. vm86 depended on our existing
bugs and this was the cause of the VESA panics last time.
- Fix the silly one-line error that caused the 'panic: bad pte' last time.
- Fix a couple of other silly one-line errors that should have caused more
pain than they did.

Some more work is needed:
- pmap_{zero,copy}_page[_idle]. These can be done without IPI's if we
have a hook in cpu_switch.
- The IPI handlers need some cleanup. I have a bogus %ds load that can
be avoided.
- APTD handling is rather bogus and appears to be a large source of
global TLB IPI shootdowns for no really good reason.

I see speedups of between 1.5% and ~4% on buildworlds in a while 1 loop.
I expect to see a bigger difference when there is significant pageout
activity or the system otherwise has memory shortages.

I have backed out a few optimizations that I had been using over the last
few days in order to be a little more conservative. I'll revisit these
again over the next few days as the dust settles.

New option: DISABLE_PG_G - In case I missed something.


# 99416 04-Jul-2002 alc

o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free
queue lock (revision 1.33 of vm/vm_page.c removed them).
o Make the free queue lock a spin lock because it's sometimes acquired
inside of a critical section.


# 99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 97963 06-Jun-2002 jhb

Change the all locks list from a STAILQ to a TAILQ. This bloats struct
lock_object by another pointer (though all of lock_object should be
conditional on LOCK_DEBUG anyways) in exchange for an O(1) TAILQ_REMOVE()
in witness_destroy() (called for every mtx_destroy() and sx_destroy())
instead of an O(n) STAILQ_REMOVE. Since WITNESS is so dog slow as it is,
the speed-up is worth the space cost.

Suggested by: iedowse


# 97948 06-Jun-2002 jhb

Handle "dead" witnesses better in the situation of several short term locks
being created and destroyed without a single long-term one around to ensure
the witness associated with that group of locks stays alive. The pipe
mutexes are an example of this group. For a dead witness we no longer
clear the witness name. Instead, when looking up the witness for a lock,
if a dead witness' (a witness with a refcount of 0) w_name pointer is
identical to the witness name of the lock then we revive that witness
instead of using a new witness for the lock. This results in far fewer
dead witness objects and also better preserves locking orders over the long
term resulting in more correct lock order checking. Note that we can't
ever derefence w_name of a dead witness since we don't know if the string
it is pointing to has been free()'d or kldunload()'d out from under us.


# 97014 20-May-2002 jhb

In witness_unlock(), when updating a lock list entry bucket, decrement the
count of lock list entries after we fixup the bucket of lock list entries.
In theory we can remove the intr_disable/intr_restore() calls now.


# 97006 20-May-2002 jhb

- Allow witness_sleep() to be called when witness hasn't been initialized
yet. We just return without performing any checks.
- Don't explicitly enter and exit critical sections when walking lock
lists. We don't need a critical section to walk the list of sleep
locks for a thread. We check to see if a spin lock list is empty
before we walk it. If the list is empty we don't need to walk it. If
it isn't then we already hold at least one spin lock and are already in
a critical section and thus don't need our own explicit critical
section.


# 96122 06-May-2002 alfred

Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.


# 95823 30-Apr-2002 alc

o Convert the vm_page buckets mutex to a spin lock. (This resolves
an issue on the Alpha platform found by jeff@.)
o Simplify vm_page_lookup().

Reviewed by: jhb


# 95543 27-Apr-2002 jhb

Whitespace bogon.


# 95541 27-Apr-2002 marcel

Insert a semi-colon between label 'skip:' and the closing brace
of the FOREACH loop to silence GCC 3.


# 95473 25-Apr-2002 des

Add the mutex profiling lock to the witness list. This hopefully unbreaks
the MUTEX_PROFILING + WITNESS + !WITNESS_SKIPSPIN case.

Submitted by: Hiten Pandya <hiten@uk.FreeBSD.org>


# 94857 16-Apr-2002 jhb

- Merge the pgrpsess_lock and proctree_lock sx locks into one proctree_lock
sx lock. Trying to get the lock order between these locks was getting
too complicated as the locking in wait1() was being fixed.
- leavepgrp() now requires an exclusive lock of proctree_lock to be held
when it is called.
- fixjobc() no longer gets a shared lock of proctree_lock now that it
requires an xlock be held by the caller.
- Locking notes in sys/proc.h are adjusted to note that everything that
used to be protected by the pgrpsess_lock is now protected by the
proctree_lock.


# 94325 09-Apr-2002 jhb

Display the recursion count in the lock_instance in the show locks
output.

Indirectly requested by: peter


# 94324 09-Apr-2002 jhb

Cosmetic fixup in output of lock types in show locks output.


# 93811 04-Apr-2002 jhb

Add a new char * pointer lo_type to struct lock_object that is used to
point to a more generic name for a lock that is more suitable for use by
witness when grouping locks. For example, although network driver locks
use the interface name for the name of each lock, they should all use the
same witness and be treated the same as witness. Another example is that
all UMA zone locks should be treated the same. The witness code has also
been updated to print out the lock type in addition to the lock name in a
few places where it is relevant.


# 93690 02-Apr-2002 jhb

Enforce an implicit lock order of sleepable locks before non-sleepable
locks.


# 93676 02-Apr-2002 jhb

Explicitly document how we implicitly enforce the lock order of sleep
locks before spin locks.


# 93273 27-Mar-2002 jeff

Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by: jhb


# 92858 21-Mar-2002 imp

Remove last two abuses of cpu_critical_{enter,exit} in the MI code.

Reviewed by: jake, jhb, rwatson


# 91905 08-Mar-2002 jhb

- Use a MI critical section in witness_sleep() and witness_list() as they
simply need to prevent switching from another CPU and do not need
interrupts disabled.
- Add a comment to witness_list() about why displaying spin locks for
threads on other CPU's really is just a bad idea and probably shouldn't
be done.


# 91367 27-Feb-2002 peter

Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.


# 91260 25-Feb-2002 peter

Work-in-progress commit syncing up pmap cleanups that I have been working
on for a while:
- fine grained TLB shootdown for SMP on i386
- ranged TLB shootdowns.. eg: specify a range of pages to shoot down with
a single IPI, since the IPI is very expensive. Adjust some callers
that used to trigger this inside tight loops to do a ranged shootdown
at the end instead.
- PG_G support for SMP on i386 (options ENABLE_PG_G)
- defer PG_G activation till after we decide what we are going to do with
PSE and the 4MB pages at the start of the kernel. This should solve
some rumored strangeness about stale PG_G entries getting stuck
underneath the 4MB pages.
- add some instrumentation for the fine TLB shootdown
- convert some asm instruction wrappers from functions to inlines. gcc
seems to do a fair bit better with this.
- [temporarily!] pessimize the tlb shootdown IPI handlers. I will fix
this again shortly.

This has been working fairly well for me for a while, but I have tweaked
it again prior to commit since my last major testing round. The only
outstanding problem that I know of is PG_G related, which is why there
is an option for it (not on by default for SMP). I have seen a world
speedups by a few percent (as much as 4 or 5% in one case) but I have
*not* accurately measured this - I am a bit sceptical of these numbers.


# 91140 23-Feb-2002 tanimura

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 90278 06-Feb-2002 jhb

Fixes for alpha pmap on SMP machines:

- Create a private list of active pmaps rather than abusing the list of all
processes when we need to look up pmaps. The process list needs a sx lock
and we can't be getting sx locks in the middle of cpu_switch()
(pmap_activate() can call pmap_get_asn() from cpu_switch()). Instead, we
protect the list with a spinlock. This also means the list is shorter
since a pmap can be used by more than one process and we could (at least
in thoery) dink with pmap's more than once, but now we only touch each
pmap once when we have to update all of them.
- Wrap pmap_activate()'s code to get a new ASN in an explicit critical section
so that when it is called while doing an exec() we can't get preempted.
- Replace splhigh() in pmap_growkernel() with a critical section to prevent
preemption while we are adjusting the kernel page tables.
- Fixes abuse of PCPU_GET(), which doesn't return an L-value.
- Also adds some slight cleanups to the ASN handling by adding some macros
instead of magic numbers in relation to the ASN and ASN generations.

Reviewed by: dfr


# 88900 05-Jan-2002 jhb

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


# 88899 05-Jan-2002 jhb

Remove brain damaged code in witness_lock(). We could have easily
just used PCPU_GET(spinlocks) w/o needing the w_mtx held. It is more
correct to just check td_critnest now though.


# 88322 20-Dec-2001 jhb

Introduce a standard name for the lock protecting an interrupt controller
and it's associated state variables: icu_lock with the name "icu". This
renames the imen_mtx for x86 SMP, but also uses the lock to protect
access to the 8259 PIC on x86 UP. This also adds an appropriate lock to
the various Alpha chipsets which fixes problems with Alpha SMP machines
dropping interrupts with an SMP kernel.


# 88088 17-Dec-2001 jhb

Modify the critical section API as follows:
- The MD functions critical_enter/exit are renamed to start with a cpu_
prefix.
- MI wrapper functions critical_enter/exit maintain a per-thread nesting
count and a per-thread critical section saved state set when entering
a critical section while at nesting level 0 and restored when exiting
to nesting level 0. This moves the saved state out of spin mutexes so
that interlocking spin mutexes works properly.
- Most low-level MD code that used critical_enter/exit now use
cpu_critical_enter/exit. MI code such as device drivers and spin
mutexes use the MI wrappers. Note that since the MI wrappers store
the state in the current thread, they do not have any return values or
arguments.
- mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is
assigned to curthread->td_savecrit during fork_exit().

Tested on: i386, alpha


# 87593 10-Dec-2001 obrien

Repeat after me -- "Use of ANSI string concatenation can be bad."
In this case, C99's __func__ is properly defined as:

static const char __func__[] = "function-name";

and GCC 3.1 will not allow it to be used in bogus string concatenation.


# 86422 15-Nov-2001 jhb

Add a couple of returns to making recovering from a failed witness_assert()
more sane in the RESTARTABLE_PANICS case.


# 84680 08-Oct-2001 jhb

Replace 'curproc' with 'td->td_proc'.


# 84331 01-Oct-2001 jhb

Move the ap boot spin lock earlier in the lock order before the sio(4)
lock since we occasionally call printf() while holding the ap boot lock
which can call down into the sio(4) driver if using a serial console.


# 83798 21-Sep-2001 jhb

Remove unneeded proc variables and fix comments.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82284 24-Aug-2001 jhb

Style nits:
- Don't use punctuation or newlines in panic messages.
- Remove excess blank lines.

Requested and partially submitted by: bde


# 82244 23-Aug-2001 jhb

Add witness_upgrade() and witness_downgrade() for handling upgrades and
downgrades of shared/exclusive locks.


# 82243 23-Aug-2001 jhb

Convert some KASSERT()'s into if (foo) panic() because they are testing
how locks are managed by the rest of the kernel, not verifying the internal
integrity of witness itself.


# 81489 10-Aug-2001 jhb

Make witness compile w/o DDB.

Reported by: wpaul


# 80747 31-Jul-2001 jhb

- Fix panicstr checks to explicitly check against NULL.
- Add a few more panicstr checks so that we don't panic recursively.

Requested by: sheldonh (2)


# 80055 20-Jul-2001 jhb

Add a missing ~ so that the LO_INITIALIZED flag actually gets turned off
in witness_destroy().


# 78942 28-Jun-2001 jhb

Forced commit. Previous commit was:

Submitted by: Alexander N. Kabaev <ak03@gte.com>


# 78941 28-Jun-2001 jhb

Don't check witness assertions if the lock doesn't use witness or witness
is dead.


# 78871 27-Jun-2001 jhb

- Add a new witness_assert() to perform arbitrary locking assertions.
- Clean up the KTR tracepoints to be slighlty more consistent and useful
- Fix a bug in WITNESS where we would recurse indefinitely and blow the
stack when acquiring Giant after sleeping with a sleepable lock held.

Reported by: tanimura (3)


# 78785 25-Jun-2001 jhb

- Move the 'clk' spinlock below other spin locks since KTR trace events
may need the clock lock for nanotime().
- Add KTR trace events for lock list manipulations and other witness
operations.
- Use a temporary variable instead of setting the lock list head directly
and then setting up the links to add a new lock list entry to the lock
list. This small race could result in witness "forgetting" about all
the locks held by this process temporarily during an interrupt.
- Close a more fatal race condition when removing a lock from a list.
Removing a lock from the list entails both decrementing the count of
items in this bucket as well as shuffling items in the current bucket up
a notch to replace the gap left by the removed item. Wrap these
operations in a critical section.


# 77900 08-Jun-2001 peter

"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.


# 77853 07-Jun-2001 peter

Back out part of my previous commit. This was a last minute change
and I botched testing. This is a perfect example of how NOT to do
this sort of thing. :-(


# 77843 06-Jun-2001 peter

Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.


# 76772 17-May-2001 jhb

- Don't panic on a try lock operation for a sleep lock if we hold a spin
lock. Since we won't actually block on a try lock operation, it's not
a problem. Add a comment explaining why it is safe to skip lock order
checking with try locks.
- Remove the ithread list lock spin lock from the order list.


# 76481 11-May-2001 jhb

Check witness_dead in more functions to avoid panic'ing when assertions
fail due to witness exhausting its internal resources and shutting down.

Reported by: Szilveszter Adam <sziszi@petra.hos.u-szeged.hu>
Tested by: David Wolfskill <david@catwhisker.org>


# 76272 04-May-2001 jhb

- Move state about lock objects out of struct lock_object and into a new
struct lock_instance that is stored in the per-process and per-CPU lock
lists. Previously, the lock lists just kept a pointer to each lock held.
That pointer is now replaced by a lock instance which contains a pointer
to the lock object, the file and line of the last acquisition of a lock,
and various flags about a lock including its recursion count.
- If we sleep while holding a sleepable lock, then mark that lock instance
as having slept and ignore any lock order violations that occur while
acquiring Giant when we wake up with slept locks. This is ok because of
Giant's special nature.
- Allow witness to differentiate between shared and exclusive locks and
unlocks of a lock. Witness will now detect the case when a lock is
acquired first in one mode and then in another. Mutexes are always
locked and unlocked exclusively. Witness will also now detect the case
where a process attempts to unlock a shared lock while holding an
exclusive lock and vice versa.
- Fix a bug in the lock list implementation where we used the wrong
constant to detect the case where a lock list entry was full.


# 76142 29-Apr-2001 alfred

When panic()'ing because of recursion on a non-recursive mutex, print
out the location it was initially locked.

Ok'd by: jake


# 75755 20-Apr-2001 jhb

Spelling nit: acquring -> acquiring.

Reported by: T. William Wells <bill@twwells.com>


# 75711 19-Apr-2001 jhb

- Whoops, forgot to enable the clock lock in the spin order list on the
alpha.
- Change the Debugger() functions to pass in the real function name.


# 75569 17-Apr-2001 jhb

Check to see if enroll() returns NULL in the witness initialization. This
can happen if witness runs out of resources during initialization or if
witness_skipspin is enabled.

Sleuthing by: Peter Jeremy <peter.jeremy@alcatel.com.au>


# 75464 13-Apr-2001 jhb

- Add a comment at the start of the spin locks list.
- The alpha SMP code uses an "ap boot" spinlock as well.


# 75364 09-Apr-2001 bp

Avoid endless recursion on panic.

Reviewed by: jhb


# 75362 09-Apr-2001 jhb

Maintain a reference count on the witness struct. When the reference
count drops to 0 in witness_destroy, set the w_name and w_file pointers
to point to the string "(dead)" and the w_line field to 0. This way,
if a mutex of a given name is used only in a module, then as long as
all mutexes in the module are destroyed when the module is unloaded,
witness will not maintain stale references to the mutex's name in the
module's data section causing a panic later on when the w_name or w_file
field's are examined.


# 75273 06-Apr-2001 jhb

- Split out the functionality of displaying the contents of a single lock
list into a public witness_list_locks() function. Call this function
twice in witness_list() instead of using an evil goto.
- Adjust the 'show locks' command to take an optional parameter which
specifies the pid of a process to list the locks of. By default the
locks held by the current process are displayed.


# 74944 28-Mar-2001 jhb

Close a race condition where if we were obtaining a sleep lock and no spin
locks were held, we could be preempted and switch CPU's in between the time
that we set a variable to the list of spin locks on our CPU and the time
that we checked that variable to ensure no spinlocks were held while
grabbing a sleep lock. Losing the race resulted in checking some other
CPU's spin lock list and bogusly panicing.


# 74930 28-Mar-2001 jhb

- s/mutexes/locks/g in appropriate comments.
- Rename the 'show mutexes' ddb command to 'show locks' since it shows
a list of all the lock objects held by the current process.


# 74912 28-Mar-2001 jhb

Rework the witness code to work with sx locks as well as mutexes.
- Introduce lock classes and lock objects. Each lock class specifies a
name and set of flags (or properties) shared by all locks of a given
type. Currently there are three lock classes: spin mutexes, sleep
mutexes, and sx locks. A lock object specifies properties of an
additional lock along with a lock name and all of the extra stuff needed
to make witness work with a given lock. This abstract lock stuff is
defined in sys/lock.h. The lockmgr constants, types, and prototypes have
been moved to sys/lockmgr.h. For temporary backwards compatability,
sys/lock.h includes sys/lockmgr.h.
- Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin
locks held. By making this per-cpu, we do not have to jump through
magic hoops to deal with sched_lock changing ownership during context
switches.
- Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with
proc->p_sleeplocks, which is a list of held sleep locks including sleep
mutexes and sx locks.
- Add helper macros for logging lock events via the KTR_LOCK KTR logging
level so that the log messages are consistent.
- Add some new flags that can be passed to mtx_init():
- MTX_NOWITNESS - specifies that this lock should be ignored by witness.
This is used for the mutex that blocks a sx lock for example.
- MTX_QUIET - this is not new, but you can pass this to mtx_init() now
and no events will be logged for this lock, so that one doesn't have
to change all the individual mtx_lock/unlock() operations.
- All lock objects maintain an initialized flag. Use this flag to export
a mtx_initialized() macro that can be safely called from drivers. Also,
we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness
performs the corresponding checks using the initialized flag.
- The lock order reversal messages have been improved to output slightly
more accurate file and line numbers.


# 74900 28-Mar-2001 jhb

- Switch from using save/disable/restore_intr to using critical_enter/exit
and change the u_int mtx_saveintr member of struct mtx to a critical_t
mtx_savecrit.
- On the alpha we no longer need a custom _get_spin_lock() macro to avoid
an extra PAL call, so remove it.
- Partially fix using mutexes with WITNESS in modules. Change all the
_mtx_{un,}lock_{spin,}_flags() macros to accept explicit file and line
parameters and rename them to use a prefix of two underscores. Inside
of kern_mutex.c, generate wrapper functions for
_mtx_{un,}lock_{spin,}_flags() (only using a prefix of one underscore)
that are called from modules. The macros mtx_{un,}lock_{spin,}_flags()
are mapped to the __mtx_* macros inside of the kernel to inline the
usual case of mutex operations and map to the internal _mtx_* functions
in the module case so that modules will use WITNESS and KTR logging if
the kernel is compiled with support for it.


# 74016 09-Mar-2001 jhb

Fix mtx_legal2block. The only time that it is bad to block on a mutex is
if we hold a spin mutex, since we can trivially get into deadlocks if we
start switching out of processes that hold spinlocks. Checking to see if
interrupts were disabled was a sort of cheap way of doing this since most
of the time interrupts were only disabled when holding a spin lock. At
least on the i386. To fix this properly, use a per-process counter
p_spinlocks that counts the number of spin locks currently held, and
instead of checking to see if interrupts are disabled in the witness code,
check to see if we hold any spin locks. Since child processes always
start up with the sched lock magically held in fork_exit(), we initialize
p_spinlocks to 1 for child processes. Note that proc0 doesn't go through
fork_exit(), so it starts with no spin locks held.

Consulting from: cp


# 73912 07-Mar-2001 jhb

- Add an extra check in priority_propagation() for UP systems to ensure we
don't end up back at ourselves which would indicate deadlock.
- Add the proc lock to the witness dup_list as we may hold more than one
process lock at a time.
- Don't assert a mutex is owned in _mtx_unlock_sleep() as that is too late.
We do the checks in the macros instead.


# 73238 28-Feb-2001 julian

Shuffle netgraph mutexes a bit and hold a reference on a node
from the function that is calling the destructor.


# 73205 28-Feb-2001 jake

Sigh. Try to get priorities sorted out. Don't bother trying to
update native priority, it is diffcult to get right and likely
to end up horribly wrong. Use an honestly wrong fixed value
that seems to work; PUSER for user threads, and the interrupt
priority for ithreads. Set it once when the process is created
and forget about it.

Suggested by: bde
Pointy hat: me


# 73114 26-Feb-2001 jake

Initialize native priority to PRI_MAX. It was usually 0 which made a
process's priority go through the roof when it released a (contested)
mutex. Only set the native priority in mtx_lock if hasn't already
been set.

Reviewed by: jhb


# 73033 25-Feb-2001 jake

Remove brackets around variables in a function that used to be
a macro.


# 73003 25-Feb-2001 julian

Move netgraph spimlock order entries out of
the #ifdef SMP section. They need to be there for UP too.


# 72996 24-Feb-2001 jhb

Grrr, s/INVARIANTS_SUPPORT/INVARIANT_SUPPORT/.


# 72994 24-Feb-2001 jhb

- Axe RETIP() as it was very i386 specific and unwieldy. Instead, use the
passed in filename and line number in the KTR tracepoint message.
- Even though it is #if 0'd code, change the code to detect that a process
is an interrupt thread to check p->p_ithd against NULL rather than
checking non-existant process flags from BSD/OS.
- Use '%p' to print pointers in KTR log messages instead of assuming
sizeof(int) == sizeof(void *).
- Don't set p_mtxname to NULL when releasing a mutex. It doesn't hurt
to leave it set (we don't clear w_mesg for example) and at least at
one time in the past, there used to be race conditions in the kernel
that would result in setting this to NULL causing the kernel to
dereference NULL.
- Make the _mtx_assert() function be compiled in if INVARIANTS_SUPPORT is
defined rather than if INVARIANTS is defined so that a KLD compiled
with INVARIANTS that uses mtx_assert() can be used with a kernel that
just has INVARIANT_SUPPORT compiled in.


# 72979 24-Feb-2001 julian

Add knowledge of the netgraph spinlocks into the Witness code.
Well, at least I think that's how it's done.


# 72836 22-Feb-2001 jhb

- Use the NOCPU constant.
- Move the ithread spin locks before sched lock and clk in preparation for
future commits to the ithread code.


# 72393 12-Feb-2001 bmilekic

Change all instances of `CURPROC' and `CURTHD' to `curproc,' in order
to stay consistent.

Requested by: bde


# 72376 11-Feb-2001 jake

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# 72344 11-Feb-2001 bmilekic

- Place back STR string declarations for lock/unlock strings used for KTR_LOCK
tracing in order to avoid duplication.
- Insert some tracepoints back into the mutex acq/rel code, thus ensuring
that we can trace all lock acq/rel's again.
- All CURPROC != NULL checks are MPASS()es (under MUTEX_DEBUG) because they
signify a serious mutex corruption.
- Change up some KASSERT()s to MPASS()es, and vice-versa, depending on the
type of problem we're debugging (INVARIANTS is used here to check that
the API is being used properly whereas MUTEX_DEBUG is used to ensure that
something general isn't happening that will have bad impact on mutex
locks).

Reminded by: jhb, jake, asmodai


# 72256 09-Feb-2001 jhb

Unify the two sleep lock order lists to enforce the process lock ->
uidinfo lock locking order.


# 72224 09-Feb-2001 jhb

- Change the 'witness_list' ddb command to 'show mutexes'. Note that this
will only display sleep mutexes held by the current process.
- Clean up some nits in the witness_display() function and add a ddb
command 'show witness' that dumps the hierarchy and order lists to the
console.
- Use queue(3) macros where appropriate.
- Resort the spin lock order list so that "com" is before "sched_lock".
Also, add appropriate #ifdef's around SMP and i386-specific mutexes.
- Add two new mutexes used to protect the ithread lists and tables to the
order list.

Requested by: bde (1)


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 71709 27-Jan-2001 jhb

Add a new ddb command 'witness_list' that lists the mutexes held by
curproc.

Requested by: peter


# 71576 24-Jan-2001 jasone

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 71560 24-Jan-2001 jhb

- Don't use a union and fun tricks to shave one extra pointer off of struct
mtx right now as it makes debugging harder. When we are in optimizing
mode, we can revisit this.
- Fix the KTR trace messages to use %p rather than 0x%p to avoid duplicate
0x's in KTR output.
- During witness_fixup, release Giant so that witness doesn't get confused.
Also, grab all_mtx while walking the list of mutexes.
- Remove w_sleep and w_recurse. Instead, perform checks on mutexes using
the mutex's mtx_flags field.
- Allow debug.witness_ddb and debug.witness_skipspin to be set from the
loader.
- Add Giant to the front of existing order_list entries to help ensure
Giant is always first.
- Add an order entry for the various proc locks. Note that this only
helps keep proc in order mostly as the allproc and proctree mutexes are
only obtained during a lockmgr operation on the specified mutex.


# 71360 22-Jan-2001 jasone

Print correct file name and line number in mtx_assert().

Noticed by: jake


# 71352 21-Jan-2001 jasone

Move most of sys/mutex.h into kern/kern_mutex.c, thereby making the mutex
inline functions non-inlined. Hide parts of the mutex implementation that
should not be exposed.

Make sure that WITNESS code is not executed during boot until the mutexes
are fully initialized by SI_SUB_MUTEX (the original motivation for this
commit).

Submitted by: peter


# 71328 21-Jan-2001 jasone

Make the order of the static initializer for all_mtx match the order of
fields in struct mtx.

Found by: jake


# 71320 21-Jan-2001 jasone

Remove MUTEX_DECLARE() and MTX_COLD. Instead, postpone full mutex
initialization until after malloc() is safe to call, then iterate through
all mutexes and complete their initialization.

This change is necessary in order to avoid some circular bootstrapping
dependencies.


# 71287 20-Jan-2001 jake

- Make npx_intr INTR_MPSAFE and move acquiring Giant into the
function itself.
- Remove a hack to allow acquiring Giant from the npx asm trap
vector.


# 71228 18-Jan-2001 bmilekic

Implement MTX_RECURSE flag for mtx_init().
All calls to mtx_init() for mutexes that recurse must now include
the MTX_RECURSE bit in the flag argument variable. This change is in
preparation for an upcoming (further) mutex API cleanup.
The witness code will call panic() if a lock is found to recurse but
the MTX_RECURSE bit was not set during the lock's initialization.

The old MTX_RECURSE "state" bit (in mtx_lock) has been renamed to
MTX_RECURSED, which is more appropriate given its meaning.

The following locks have been made "recursive," thus far:
eventhandler, Giant, callout, sched_lock, possibly some others declared
in the architecture-specific code, all of the network card driver locks
in pci/, as well as some other locks in dev/ stuff that I've found to
be recursive.

Reviewed by: jhb


# 70861 10-Jan-2001 jake

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 69998 13-Dec-2000 jhb

- Add a new flag MTX_QUIET that can be passed to the various mtx_*
functions. If this flag is set, then no KTR log messages are issued.
This is useful for blocking excessive logging, such as with the internal
mutex used by the witness code.
- Use MTX_QUIET on all of the mtx_enter/exit operations on the internal
mutex used by the witness code.
- If we are in a panic, don't do witness checks in witness_enter(),
witness_exit(), and witness_try_enter(), just return.


# 69881 11-Dec-2000 jake

- Add code to detect if a system call returns with locks other than Giant
held and panic if so (conditional on witness).
- Change witness_list to return the number of locks held so this is easier.
- Add kern/syscalls.c to the kernel build if witness is defined so that the
panic message can contain the name of the offending system call.
- Add assertions that Giant and sched_lock are not held when returning from
a system call, which were missing for alpha and ia64.


# 69879 11-Dec-2000 jhb

Oops, the witness mutex is a spin lock, so use MTX_SPIN in the call to
mtx_init(). Since the witness code ignores its internal mutex, this
doesn't result in any functional change.


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 69429 30-Nov-2000 jhb

Split the WITNESS and MUTEX_DEBUG options apart so that WITNESS does not
depend on MUTEX_DEBUG. The MUTEX_DEBUG option turns on extra assertions
and checks to verify that mutexes themselves are implemented properly.
The WITNESS option uses extra checks and diagnostics to verify that other
code is using mutexes properly.


# 69376 29-Nov-2000 jhb

Fix up priority propagation:
- Use a better test for determining when a process is running.
- Convert some checks to assertions.
- Remove unnecessary tests.
- Save the priority before acquiring a mutex rather than in msleep(9).


# 69369 29-Nov-2000 jhb

Set p_mtxname when blocking on a mutex and clear it when waking up.


# 69363 29-Nov-2000 jhb

Use an atomic operation with an appropriate memory barrier when releasing
a contested sleep mutex in the case that at least two processes are blocked
on the contested mutex.


# 69362 29-Nov-2000 jhb

The sched_lock mutex goes after the sio mutex in the locking order since
a software interrupt can be scheduled in the sio interrupt handler while
the sio mutex is held.


# 69361 29-Nov-2000 jhb

Save the line number and filename of the last mtx_enter operation for
spin locks. We already do this for sleep locks.


# 69215 26-Nov-2000 alfred

Move the #define of _KERN_MUTEX_C_ so that it's before any system headers
are included. System headers can include sys/mutex.h and then certain
macros do not get defined.

Reviewed by: jake


# 69208 26-Nov-2000 jake

Add uidinfo hash and uidinfo struct to the witness order list.


# 68889 19-Nov-2000 jake

- Protect the callout wheel with a separate spin mutex, callout_lock.
- Use the mutex in hardclock to ensure no races between it and
softclock.
- Make softclock be INTR_MPSAFE and provide a flag,
CALLOUT_MPSAFE, which specifies that a callout handler does not
need giant. There is still no way to set this flag when
regstering a callout.

Reviewed by: -smp@, jlemon


# 68862 17-Nov-2000 jake

- Split the run queue and sleep queue linkage, so that a process
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.

Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb


# 68808 16-Nov-2000 jhb

Don't release and acquire Giant in mi_switch(). Instead, release and
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.

Submitted by: witness


# 68790 15-Nov-2000 jhb

Include the right headers to get the DDB #define and the db_active variable.


# 68786 15-Nov-2000 jhb

Declare the 'witness_spin_check' properly as a per-CPU variable in the
non-SMP case.


# 68785 15-Nov-2000 jhb

Don't perform witness checks in witness_enter() during a panic.


# 68582 10-Nov-2000 jhb

Minor whitespace nit in a comment.


# 67676 27-Oct-2000 jhb

- Use MUTEX_DECLARE() and MTX_COLD for the WITNESS code's internal mutex so
it can function before malloc(9) is up and running.
- Add two new options WITNESS_DDB and WITNESS_SKIPSPIN. If WITNESS_SKIPSPIN
is enabled, then spin mutexes are ignored by the WITNESS code. If
WITNESS_DDB is turned on and DDB is compiled into the kernel, then the
kernel will drop into DDB when either a lock hierarchy violation occurs
or mutexes are held when going to sleep.
- Add some new sysctls:
debug.witness_ddb is a read-write sysctl that corresponds to WITNESS_DDB.
The kernel option merely changes the default value to on at boot.
debug.witness_skipspin is a read-only sysctl that one can use to determine
if the kernel was compiled with WITNESS_SKIPSPIN.
- Wipe out the BSD/OS-specific lock order lists. We get to build our own
lists now as we add mutexes to the kernel.


# 67548 25-Oct-2000 jhb

Quite some warnings.


# 67404 20-Oct-2000 jhb

Propogate the 'const'ness of mutex descriptions to the witness code to
quiet warnings.


# 67401 20-Oct-2000 jhb

Actually enable the witness code if the WITNESS kernel option is enabled.


# 67396 20-Oct-2000 jhb

Doh. Fix a 64-bit-ism by using uintptr_t for a temporary lock variable
instead of int.


# 67352 20-Oct-2000 jhb

- Make the mutex code almost completely machine independent. This greatly
reducues the maintenance load for the mutex code. The only MD portions
of the mutex code are in machine/mutex.h now, which include the assembly
macros for handling mutexes as well as optionally overriding the mutex
micro-operations. For example, we use optimized micro-ops on the x86
platform #ifndef I386_CPU.
- Change the behavior of the SMP_DEBUG kernel option. In the new code,
mtx_assert() only depends on INVARIANTS, allowing other kernel developers
to have working mutex assertiions without having to include all of the
mutex debugging code. The SMP_DEBUG kernel option has been renamed to
MUTEX_DEBUG and now just controls extra mutex debugging code.
- Abolish the ugly mtx_f hack. Instead, we dynamically allocate
seperate mtx_debug structures on the fly in mtx_init, except for mutexes
that are initiated very early in the boot process. These mutexes
are declared using a special MUTEX_DECLARE() macro, and use a new
flag MTX_COLD when calling mtx_init. This is still somewhat hackish,
but it is less evil than the mtx_f filler struct, and the mtx struct is
now the same size with and without mutex debugging code.
- Add some micro-micro-operation macros for doing the actual atomic
operations on the mutex mtx_lock field to make it easier for other archs
to override/optimize mutex ops if needed. These new tiny ops also clean
up the code in some places by replacing long atomic operation function
calls that spanned 2-3 lines with a short 1-line macro call.
- Don't call mi_switch() from mtx_enter_hard() when we block while trying
to obtain a sleep mutex. Calling mi_switch() would bogusly release
Giant before switching to the next process. Instead, inline most of the
code from mi_switch() in the mtx_enter_hard() function. Note that when
we finally kill Giant we can back this out and go back to calling
mi_switch().


# 65856 14-Sep-2000 jhb

Remove the mtx_t, witness_t, and witness_blessed_t types. Instead, just
use struct mtx, struct witness, and struct witness_blessed.

Requested by: bde


# 65651 09-Sep-2000 jasone

Style cleanups. No functional changes.


# 65650 09-Sep-2000 jasone

Add file and line arguments to WITNESS_ENTER() and WITNESS_EXIT, since
__FILE__ and __LINE__ don't get expanded usefully in inline functions.

Add const to all witness*() arguments that are filenames.


# 65624 08-Sep-2000 jasone

Rename mtx_enter(), mtx_try_enter(), and mtx_exit() and wrap them with cpp
macros that expand to pass filename and line number information. This is
necessary since we're using inline functions instead of macros now.

Add const to the filename pointers passed througout the mtx and witness
code.


# 65557 06-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh