History log of /freebsd-9.3-release/sys/kern/kern_exit.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 267654 19-Jun-2014 gjb

Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 262057 17-Feb-2014 avg

MFC r258622,258675: dtrace sdt: remove the ugly sname parameter of
SDT_PROBE_DEFINE


# 260227 03-Jan-2014 jilles

MFC r258281: Fix siginfo_t.si_status for wait6/waitid/SIGCHLD.

Per POSIX, si_status should contain the value passed to exit() for
si_code==CLD_EXITED and the signal number for other si_code. This was
incorrect for CLD_EXITED and CLD_DUMPED.

This is still not fully POSIX-compliant (Austin group issue #594 says that
the full value passed to exit() shall be returned via si_status, not just
the low 8 bits) but is sufficient for a si_status-related test in libnih
(upstart, Debian/kFreeBSD).

PR: kern/184002


# 255763 21-Sep-2013 markj

MFC r252894:
Add SDT_PROBE_DEFINE0 for consistency with SDT_PROBE0.

MFC r253022:
Also define SDT_PROBE_DEFINE0 for the !KDTRACE_HOOKS case.

MFC r254266:
Add event handlers for module load and unload events. The load handlers are
called after the module has been loaded, and the unload handlers are called
before the module is unloaded. Moreover, the module unload handlers may
return an error to prevent the unload from proceeding.

MFC r254267:
Remove some unused fields from struct linker_file. They were added in
r172862 for use by the DTrace SDT framework but don't seem to have ever
been used.

MFC r254268:
FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

MFC r254309:
Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

MFC r254350:
Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.


# 247349 26-Feb-2013 jhb

MFC 240467:
Ignore stop and continue signals sent to an exiting process. Stop signals
set p_xstat to the signal that triggered the stop, but p_xstat is also
used to hold the exit status of an exiting process. Without this change,
a stop signal that arrived after a process was marked P_WEXIT but before
it was marked a zombie would overwrite the exit status with the stop signal
number.


# 247077 21-Feb-2013 kib

MFC r246484:
Allow ptrace(2) operation on the child created by vfork(2), if the
debugger is not the parent.


# 246404 06-Feb-2013 kib

MFC r246118:
The case of pid == WAIT_MYPGRP for the kern_wait() is already handled
in kern_wait6(), which is called by kern_wait(). Remove the redundand
check, introduced in r243136, and add a comment noting this, to make
the code less confusing.


# 245218 09-Jan-2013 kib

MFC r245104:
Protect the p->p_pgrp dereference with the process lock.


# 244172 13-Dec-2012 kib

MFC r242958:
Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346

MFC r243133:
Style fixes for r242958.

MFC r243134:
Alphabetically reorder the forward-declarations of the structures.
Add the declaration for enum idtype, to be used later.

MFC r243135:
Move the definition of the idtype_t from sys/types.h to sys/wait.h.
Fix the bug, use #if __BSD_VISIBLE instead of #if defined(__BSD_VISIBLE),
since __BSD_VISIBLE is always defined.
Reformat the comments from the Solaris style to KNF.

MFC r243136:
Restore the proper handling of the pid 0 for waitpid(2).
Fix the style around.


# 233920 05-Apr-2012 kib

MFC r233809:
When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.


# 233919 05-Apr-2012 kib

MFC r233808:
Add helper function to remove the process from the orphans list and
use it instead of inlined code.


# 233281 21-Mar-2012 jh

MFC r232975: Add an assert for proctree_lock to proc_to_reap().


# 233028 16-Mar-2012 kib

MFC r232947:
Lock the process around manipulations with p_flag.


# 232693 08-Mar-2012 kib

MFC r232048:
Allow the parent to gather the exit status of the children reparented
to the debugger. When reparenting for debugging, keep the child in
the new orphan list of old parent. When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.

MFC r232104:
Restore the return statement erronously removed in the r232048.

In order to keep stable/9 KBI, the p_dbg_child member of struct proc
was replaced with padding.


# 225736 22-Sep-2011 kensmith

Copy head to stable/9 as part of 9.0-RELEASE release cycle.

Approved by: re (implicit)


# 225641 17-Sep-2011 trasz

Fix long-standing thinko regarding maxproc accounting. Basically,
we were accounting the newly created process to its parent instead
of the child itself. This caused problems later, when the child
changed its credentials - the per-uid, per-jail etc counters were
not properly updated, because the maxproc counter in the child
process was 0.

Approved by: re (kib)


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 224987 18-Aug-2011 jonathan

Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# 223825 06-Jul-2011 trasz

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 223088 14-Jun-2011 obrien

We should not return ECHILD when debugging a child and the parent does a
"wait4(-1, ..., WNOHANG, ...)". Instead wait(2) should behave as if the
child does not wish to report status at this time.

Reviewed by: jhb


# 220621 14-Apr-2011 pluknet

Remove stale M_ZOMBIE malloc type.
This type is unused since embedding p_ru into struct proc.

MFC after: 1 week


# 220526 10-Apr-2011 kib

Some callers of proc_reparent() already have the parent process locked.
Detect the situation and avoid process lock recursion.

Reported by: Fabian Keil <freebsd-listen fabiankeil de>


# 220222 31-Mar-2011 trasz

Enable accounting for RACCT_NPROC and RACCT_NTHR.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 220137 29-Mar-2011 trasz

Add racct. It's an API to keep per-process, per-jail, per-loginclass
and per-loginclass resource accounting information, to be used by the new
resource limits code. It's connected to the build, but the code that
actually calls the new functions will come later.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 215664 22-Nov-2010 netchild

By using the 32-bit Linux version of Sun's Java Development Kit 1.6
on FreeBSD (amd64), invocations of "javac" (or "java") eventually
end with the output of "Killed" and exit code 137.

This is caused by:
1. After calling exec() in multithreaded linux program threads are not
destroyed and continue running. They get killed after program being
executed finishes.

2. linux_exit_group doesn't return correct exit code when called not
from group leader. Which happens regularly using sun jvm.

The submitters fix this in a similar way to how NetBSD handles this.

I took the PRs away from dchagin, who seems to be out of touch of
this since a while (no response from him).

The patches committed here are from [2], with some little modifications
from me to the style.

PR: 141439 [1], 144194 [2]
Submitted by: Stefan Schmidt <stefan.schmidt@stadtbuch.de>, gk
Reviewed by: rdivacky (in april 2010)
MFC after: 5 days


# 214158 21-Oct-2010 jhb

- When disabling ktracing on a process, free any pending requests that
may be left. This fixes a memory leak that can occur when tracing is
disabled on a process via disabling tracing of a specific file (or if
an I/O error occurs with the tracefile) if the process's next system
call is exit(). The trace disabling code clears p_traceflag, so exit1()
doesn't do any KTRACE-related cleanup leading to the leak. I chose to
make the free'ing of pending records synchronous rather than patching
exit1().
- Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into
kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a
result.

MFC after: 1 month


# 213642 09-Oct-2010 davidxu

Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian


# 211616 22-Aug-2010 rpaulo

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# 209592 29-Jun-2010 jhb

Tweak the in-kernel API for sending signals to threads:
- Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c.
- Add tdsignal() and tdksignal() routines that mirror psignal() and
pksignal() except that they accept a thread as an argument instead of
a process. They send a signal to a specific thread rather than to an
individual process.

Reviewed by: kib


# 200732 19-Dec-2009 ed

Let access overriding to TTYs depend on the cdev_priv, not the vnode.

Basically this commit changes two things, which improves access to TTYs
in exceptional conditions. Basically the problem was that when you ran
jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the
node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if
you want to attach to screens quickly, use ssh(1), etc.

The fixes:

- Cache the cdev_priv of the controlling TTY in struct session. Change
devfs_access() to compare against the cdev_priv instead of the vnode.
This allows you to bypass UNIX permissions, even across different
mounts of devfs.

- Extend devfs_prison_check() to unconditionally expose the device node
of the controlling TTY, even if normal prison nesting rules normally
don't allow this. This actually allows you to interact with this
device node.

To be honest, I'm not really happy with this solution. We now have to
store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp).
In an ideal world, we should just get rid of the latter two and only use
s_ttyp, but this makes certian pieces of code very impractical (e.g.
devfs, kern_exit.c).

Reported by: Many people


# 197942 10-Oct-2009 kib

Refine r195509, instead of checking that vnode type is VBAD, that is
set quite late in the revocation path, properly verify that vnode is
not doomed before calling VOP.

Reported and tested by: Harald Schmalzbauer <h.schmalzbauer omnilan de>
MFC after: 3 days


# 196567 26-Aug-2009 marius

Add a temporary workaround which just lets init die instead of
causing a panic if it is killed due to a unsolved stack overflow
seen very late during shutdown on sparc64 when the gmirror worker
process exists, which is a regression introduced in 8.0.

Reviewed by: kib
MFC after: 3 days


# 195741 17-Jul-2009 jamie

Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by: re (kib), bz (mentor)


# 195509 09-Jul-2009 kib

The control terminal revocation at the session leader exit does not
correctly checks for reclaimed vnode, possibly calling VOP_REVOKE for
such vnode. If the terminal is already revoked, or devfs mount was
forcibly unmounted, the revocation of doomed ctty vnode causes panic.

Reported and tested by: lstewart
Approved by: re (kensmith)
MFC after: 2 weeks


# 195235 01-Jul-2009 rwatson

udit the 'options' argument to wait4(2).

Approved by: re (kib)
MFC after: 3 days


# 195104 27-Jun-2009 rwatson

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 194264 15-Jun-2009 ed

Perform some more cleanups to in-kernel session handling.

The code that was in place in exit1() was mainly based on code from the
old TTY layer. The main reason behind this, was because at one moment I
ran a system that had two TTY layers in place at the same time. It is
now sufficient to do the following:

- Remove references from the session structure to the TTY vnode and the
session leader.

- If we have a controlling TTY and the session used by the TTY is equal
to our session, send the SIGHUP.

- If we have a vnode to the controlling TTY which has not been revoked,
revoke it.

While there, change sys/kern/tty.c to use s_ttyp in the comparison
instead of s_ttyvp. It should not make any difference, because s_ttyvp
can only become null when the session leader already left, but it's
nicer to compare against the proper value.


# 194256 15-Jun-2009 ed

Make tcsetsid(3) work on revoked TTYs.

Right now the only way to make tcsetsid(3)/TIOCSCTTY work, is by
ensuring the session leader is dead. This means that an application that
catches SIGHUPs and performs a sleep prevents us from assigning a new
session leader.

Change the code to make it work on revoked TTYs as well. This allows us
to change init(8) to make the shutdown script run in a more clean
environment.


# 193723 08-Jun-2009 rwatson

Move zombie-reaping code out of kern_wait() and into its own function,
proc_reap().
Reviewed by: jhb
MFC after: 3 days
Sponsored by: Google, Inc.


# 193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 191915 08-May-2009 zec

Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by: bz
Approved by: julian (mentor)


# 191319 20-Apr-2009 kib

Fix typo.

Noted by: jhb
MFC after: 2 weeks


# 191313 20-Apr-2009 kib

On the exit of the child process which parent either set SA_NOCLDWAIT
or ignored SIGCHLD, unconditionally wake up the parent instead of doing
this only when the child is a last child.

This brings us in line with other U**xes that support SA_NOCLDWAIT. If
the parent called waitpid(childpid), then exit of the child should wake
up the parent immediately instead of forcing it to wait for all children
to exit.

Reported by: Alan Ferrency <alan pair com>
Submitted by: Jilles Tjoelker <jilles stack nl>
PR: 108390
MFC after: 2 weeks


# 189074 26-Feb-2009 ed

Remove even more unneeded variable assignments.

kern_time.c:
- Unused variable `p'.

kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
initialize it. There is no way that error != 0 at the end of
create_thread().

kern_sig.c:
- Unused variable `code'.

kern_synch.c:
- `rval' is always assigned in all different cases.

kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.

kern_malloc.c:
- `size' is always initialized with the proper value before being used.

kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
returns a non-zero value.

kern_exec.c:
- `len' is always assigned inside the if-statement right below it.

tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().

Found by: LLVM's scan-build


# 185647 05-Dec-2008 kib

Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week


# 185435 29-Nov-2008 bz

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# 183911 15-Oct-2008 davidxu

Move per-thread userland debugging flags into seperated field,
this eliminates some problems of locking, e.g, a thread lock is needed
but can not be used at that time. Only the process lock is needed now
for new field.


# 182424 28-Aug-2008 davidxu

Don't remove queued SIGCHLD if options contain WNOWAIT, so other
threads still can be notified by the signal.


# 182193 26-Aug-2008 kib

Implement WNOWAIT flag for wait4(2). It specifies that process whose status
is returned shall be kept in the waitable state.
Add WSTOPPED as an alias for WUNTRACED.

Submitted by: Jukka Ukkonen <jau at iki fi>
PR: standards/116221
MFC after: 2 weeks


# 181905 20-Aug-2008 ed

Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan


# 179276 24-May-2008 jb

Add DTrace 'proc' provider probes using the Statically Defined Trace
(sdt) mechanism.


# 177503 22-Mar-2008 phk

In abort2(2): Accept a NULL arg pointer if nargs == 0


# 177368 19-Mar-2008 jeff

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 176365 17-Feb-2008 kris

Switch from conditionally dropping Giant in exit1() to asserting it is
not held, which appears to be always true.


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 173004 26-Oct-2007 julian

Introduce a way to make pure kernal threads.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.

kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.

All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.

fix top to show kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.

make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'

man page fixes to follow.


# 172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 170685 13-Jun-2007 jhb

Improve the ktrace locking somewhat to reduce overhead:
- Depessimize userret() in kernels where KTRACE is enabled by doing an
unlocked check of the per-process queue of pending events before
acquiring any locks. Previously ktr_userret() unconditionally acquired
the global ktrace_sx lock on every return to userland for every thread,
even if ktrace wasn't enabled for the thread.
- Optimize the locking in exit() to first perform an unlocked read of
p_traceflag to see if ktrace is enabled and only acquire locks and
teardown ktrace if the test succeeds. Also, explicitly disable tracing
before draining any pending events so the pending events actually get
written out. The unlocked read is safe because proc lock is acquired
earlier after single-threading so p_traceflag can't change between then
and this check (well, it can currently due to a bug in ktrace I will fix
next, but that race existed prior to this change as well).

Reviewed by: rwatson


# 170472 09-Jun-2007 attilio

rufetch and calcru sometimes should be called atomically together.
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.

Reviewed by: jeff
Approved by: jeff (mentor)


# 170466 09-Jun-2007 attilio

The current rusage code show peculiar problems:
- Unsafeness on ruadd() in thread_exit()
- Unatomicity of thread_exiit() in the exit1() operations

This patch addresses these problems allocating p_fd as part of the
process and modifying the way it is accessed.

A small chunk of this patch, resolves a race about p_state in kern_wait(),
since we have to be sure about the zombif-ing process.

Submitted by: jeff
Approved by: jeff (mentor)


# 170407 07-Jun-2007 rwatson

Move per-process audit state from a pointer in the proc structure to
embedded storage in struct ucred. This allows audit state to be cached
with the thread, avoiding locking operations with each system call, and
makes it available in asynchronous execution contexts, such as deep in
the network stack or VFS.

Reviewed by: csjp
Approved by: re (kensmith)
Obtained from: TrustedBSD Project


# 170307 04-Jun-2007 jeff

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 170174 31-May-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 169564 14-May-2007 jhb

Move cpu_exit() earlier in exit1() to close a race between
SIGCHLD/kevent(2) notification of process termination and wait(). Now
we no longer drop locks between sending the notification and marking
the process as a zombie. Previously, if another process attempted to do
a wait() with W_NOHANG after receiving a SIGCHLD or kevent and locked
the process while the exiting thread was in cpu_exit(), then wait() would
fail to find the process, which is quite astonishing to the process
calling wait().

MFC after: 3 days


# 167787 21-Mar-2007 jhb

Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes,
rwlocks, and sx locks to 'lock_object'.


# 167232 05-Mar-2007 rwatson

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 167211 04-Mar-2007 rwatson

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 163676 25-Oct-2006 davidxu

Move sigqueue_take() call into proc_reparent(), this fixed bugs where
proc_reparent() is called but sigqueue_take() is forgotten.


# 163653 24-Oct-2006 davidxu

Protect sigqueue_take() call by child process's lock, it fixed a
potential race with ptrace 'attach' which changes parent of the
child process.


# 163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 163600 21-Oct-2006 davidxu

Since revision 1.333 of kern_sig.c no longer uses P_WEXIT, the change
opened a race window which can cause memory leak in signal queue.
Here we free memory for signal queue when process state is set to
PRS_ZOMBIE.


# 162284 13-Sep-2006 csjp

Back out one of the Giant removals from revision 1.272. Giant was not here to
protect the vnode, it was present to synchronize access to TTY session
information between exit(2) and the TTY code. While we are here, note that
Giant is required for TTY protection.

Clue from: bde
Discussed with: jhb
MFC after: 1 week


# 159054 29-May-2006 tegge

Close race between vmspace_exitfree() and exit1() and races between
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.

Factor out part of exit1() into new function vmspace_exit(). Attach
to vmspace0 to allow old vmspace to be freed earlier.

Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process. Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.

Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().

Reviewed by: alc


# 157632 10-Apr-2006 csjp

Kill the last Giant acquisition in the exit(2) code. This Giant acquisition
doesn't appear to be protecting anything. Most of consumers funsetownlst(9)
do not appear to be picking up Giant anywhere. This was originally a part
of my Giant exit(2) clean up revision 1.272 but I thought it was a good idea
to leave it out until we were able to analyze it better.

Tested by: kris
MFC after: 3 weeks


# 157443 03-Apr-2006 peter

Remove the unused sva and eva arguments from pmap_remove_pages().


# 156705 14-Mar-2006 davidxu

1. Count last time slice, this intends to fix
"calcru: runtime went backwards" bug for threaded process.
2. Add comment about possible logical problem with scheduler.

MFC after: 3 days


# 155922 22-Feb-2006 jhb

Close some races between procfs/ptrace and exit(2):
- Reorder the events in exit(2) slightly so that we trigger the S_EXIT
stop event earlier. After we have signalled that, we set P_WEXIT and
then wait for any processes with a hold on the vmspace via PHOLD to
release it. PHOLD now KASSERT()'s that P_WEXIT is clear when it is
invoked, and PRELE now does a wakeup if P_WEXIT is set and p_lock drops
to zero.
- Change proc_rwmem() to require that the processing read from has its
vmspace held via PHOLD by the caller and get rid of all the junk to
screw around with the vmspace reference count as we no longer need it.
- In ptrace() and pseudofs(), treat a process with P_WEXIT set as if it
doesn't exist.
- Only do one PHOLD in kern_ptrace() now, and do it earlier so it covers
FIX_SSTEP() (since on alpha at least this can end up calling proc_rwmem()
to clear an earlier single-step simualted via a breakpoint). We only
do one to avoid races. Also, by making the EINVAL error for unknown
requests be part of the default: case in the switch, the various
switch cases can now just break out to return which removes a _lot_ of
duplicated PRELE and proc unlocks, etc. Also, it fixes at least one bug
where a LWP ptrace command could return EINVAL with the proc lock still
held.
- Changed the locking for ptrace_single_step(), ptrace_set_pc(), and
ptrace_clear_single_step() to always be called with the proc lock
held (it was a mixed bag previously). Alpha and arm have to drop
the lock while the mess around with breakpoints, but other archs
avoid extra lock release/acquires in ptrace(). I did have to fix a
couple of other consumers in kern_kse and a few other places to
hold the proc lock and PHOLD.

Tested by: ps (1 mostly, but some bits of 2-4 as well)
MFC after: 1 week


# 155883 21-Feb-2006 jhb

Move the ruadd() in kern_exit() to save our final stats in our child
stats even further down in exit1() so that it includes the runtime and
tick counts from the final time slice for the dying thread.

Reviewed by: phk


# 155534 11-Feb-2006 phk

CPU time accounting speedup (step 2)

Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.

Don't reach across CPUs in calcru().

Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).

Don't enforce monotonicity (at least for now) in calcru. While the
calibrated cpu_tickrate ramps up it may not be true.

Use 27MHz counter on i386/Geode.

Use TSC on amd64 & i386 if present.

Use tick counter on sparc64


# 155444 07-Feb-2006 phk

Modify the way we account for CPU time spent (step 1)

Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines. (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark. On more modern hardware no change
in performance is seen.


# 155400 06-Feb-2006 jhb

- Move the wakeup() for exiting kthreads out of exit1() and into
kthread_exit() as that is cleaner and less obscured. It also does the
wakeup sooner.
- Add some comments to kthread_exit().


# 155368 05-Feb-2006 wsalamon

Audit the pid being requested in wait4().

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# 155354 05-Feb-2006 rwatson

On process exit, audit the return value of the process, and commit the
record immediately, as this system call never returns.

Obtained from: TrustedBSD Project


# 155267 03-Feb-2006 jhb

Add a comment.


# 155198 01-Feb-2006 rwatson

Hook up audit to fork() and exit() events. These changes manage the
audit state on processes, not auditing of these events.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 154731 23-Jan-2006 ups

Hopefully fix the "calcru: runtime went backwards from ..." problem by
keeping the resource values locked (where needed) while we use them
for calculations.

MFC after: 3 days


# 153681 23-Dec-2005 phk

Regenerate sysent with new abort2 system call.

Implement abort2(const char *reason, int narg, void **args);

Submitted by: "Wojciech A. Koszek" <dunstan@freebsd.czest.pl>


# 153259 09-Dec-2005 davidxu

Register itimers_event_hook as a kernel event handler, so I don't
have to duplicate code to call it in exec() and exit1().


# 152376 13-Nov-2005 rwatson

Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.

- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
process.

- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.

- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.

MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>


# 152207 08-Nov-2005 csjp

Giant clean up for exit(2)

-Change unconditional aquisition of Giant to only pickup Giant if the vnode
for the controlling tty resides on a non-mpsafe file system.
-Pickup Giant around executable vnode reference counting operations only if
the executable resides on a non-mpsafe file system.
-If this process is being traced, pickup Giant for trace file reference count
operations only if it resides on a non-mpsafe file system.

Discussed with: jhb
Tested by: kris


# 152185 08-Nov-2005 davidxu

Add support for queueing SIGCHLD same as other UNIX systems did.

For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.

The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.

There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.

Tested on: i386 (SMP and UP)


# 151932 01-Nov-2005 jhb

Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by: csjp
MFC after: 1 week


# 151695 26-Oct-2005 glebius

- Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
difficult to understand.

Reported by: Vladimir Sharun <sharun ukr.net>


# 151585 23-Oct-2005 davidxu

Make p_itimers as a pointer, so file sys/proc.h does not need to include
sys/timers.h.


# 151576 23-Oct-2005 davidxu

Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.


# 151316 14-Oct-2005 davidxu

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 149736 02-Sep-2005 jhb

Add witness warnings to panic if a thread tries to exit while holding any
locks.

Requested by: jeff
MFC after: 3 days


# 148137 18-Jul-2005 jhb

- Slightly reorder the events around the setting of PRS_ZOMBIE to be less
hokie and much more readable and expand the comment to explain why it is
the way that it is.
- Close a race where one CPU could free the process belonging to a thread
on another CPU that hasn't quite finished exiting yet but is beyond the
point of setting the process state as PRS_ZOMBIE.

Reported and tested by: ps (2)
MFC after: 3 days


# 146554 23-May-2005 ups

Use low level constructs borrowed from interrupt threads to wait for
work in proc0.
Remove the TDP_WAKEPROC0 workaround.


# 145899 05-May-2005 davidxu

Only check signal event, single threading event shouldn't be reported.


# 145439 23-Apr-2005 davidxu

Wake up swapper process if needed.

PR: kern/78474
Submitted by: Sam Lawrance <boris at brooknet dot com dot au>


# 145272 19-Apr-2005 davidxu

Clear P_STATCHILD earlier to avoid unnecessary retrying.


# 145260 19-Apr-2005 davidxu

Fix a race condition between kern_wait() and thread_stopped().
Problem is in kern_wait(), parent process steps through children list,
once a child process is skipped, and later even if the child is stopped,
parent process still sleeps in msleep(), the race happens if parent
masked SIGCHLD.

Submitted by : Peter Edwards peadar.edwards at gmail dot com
MFC after : 4 days


# 145234 18-Apr-2005 rwatson

Introduce p_canwait() and MAC Framework and MAC Policy entry points
mac_check_proc_wait(), which control the ability to wait4() specific
processes. This permits MAC policies to limit information flow from
children that have changed label, although has to be handled carefully
due to common programming expectations regarding the behavior of
wait4(). The cr_seeotheruids() check in p_canwait() is #if 0'd for
this reason.

The mac_stub and mac_test policies are updated to reflect these new
entry points.

Sponsored by: SPAWAR, SPARTA
Obtained from: TrustedBSD Project


# 143496 13-Mar-2005 jeff

- A lock is required before calling VOP_REVOKE. Our reference protects us
from accessing another vnode so a naked VOP_LOCK is sufficient.

Sponsored by: Isilon Systems, Inc.


# 140961 29-Jan-2005 phk

In 1.276 of kern/subr_trap.c I introduced a mechanism for delaying
a process return to userspace if it had pending GEOM events.

We need to have the same check in the exit pass to catch the case
where a GEOM related filedescriptor is not explicitly closed by
the process.

Bumped into by: people using dd(1) to build releases, nanobsd etc.


# 139893 08-Jan-2005 rwatson

In kern_wait(), let the compiler copy the rusage structure rather than
an explicit bcopy() -- it probably does a better job.


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 139739 05-Jan-2005 jhb

- Move the function prototypes for kern_setrlimit() and kern_wait() to
sys/syscallsubr.h where all the other kern_foo() prototypes live.
- Resort kern_execve() while I'm there.


# 138129 27-Nov-2004 das

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# 136807 23-Oct-2004 davidxu

Remove P_STOPPED_TRACE bit if debugger dies without a chance to
detach debugged process.


# 136152 05-Oct-2004 jhb

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# 135763 24-Sep-2004 jhb

Some more whitespace, style, and comment fixes.

Submitted by: bde (mostly)


# 135688 23-Sep-2004 jhb

A modest collection of various and sundry style, spelling, and whitespace
fixes.

Submitted by: bde (mostly)


# 135573 22-Sep-2004 jhb

Various small style fixes.


# 134791 05-Sep-2004 julian

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 132898 30-Jul-2004 alc

Giant is no longer required by vm_waitproc() and vmspace_exitfree().
Eliminate it acquisition and release around vm_waitproc() in kern_wait().


# 132684 27-Jul-2004 alc

- Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.


# 132372 18-Jul-2004 julian

When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter


# 132087 13-Jul-2004 davidxu

Add code to support debugging threaded process.

1. Add tm_lwpid into kse_thr_mailbox to indicate which kernel
thread current user thread is running on. Add tm_dflags into
kse_thr_mailbox, the flags is written by debugger, it tells
UTS and kernel what should be done when the process is being
debugged, current, there two flags TMDF_SSTEP and TMDF_DONOTRUNUSER.

TMDF_SSTEP is used to tell kernel to turn on single stepping,
or turn off if it is not set.

TMDF_DONOTRUNUSER is used to tell kernel to schedule upcall
whenever possible, to UTS, it means do not run the user thread
until debugger clears it, this behaviour is necessary because
gdb wants to resume only one thread when the thread's pc is
at a breakpoint, and thread needs to go forward, in order to
avoid other threads sneak pass the breakpoints, it needs to remove
breakpoint, only wants one thread to go. Also, add km_lwp to
kse_mailbox, the lwp id is copied to kse_thr_mailbox at context
switch time when process is not being debugged, so when process
is attached, debugger can map kernel thread to user thread.

2. Add p_xthread to proc strcuture and td_xsig to thread structure.
p_xthread is used by a thread when it wants to report event
to debugger, every thread can set the pointer, especially, when
it is used in ptracestop, it is the last thread reporting event
will win the race. Every thread has a td_xsig to exchange signal
with debugger, thread uses TDF_XSIG flag to indicate it is reporting
signal to debugger, if the flag is not cleared, thread will keep
retrying until it is cleared by debugger, p_xthread may be
used by debugger to indicate CURRENT thread. The p_xstat is still
in proc structure to keep wait() to work, in future, we may
just use td_xsig.

3. Add TDF_DBSUSPEND flag, the flag is used by debugger to suspend
a thread. When process stops, debugger can set the flag for
thread, thread will check the flag in thread_suspend_check,
enters a loop, unless it is cleared by debugger, process is
detached or process is existing. The flag is also checked in
ptracestop, so debugger can temporarily suspend a thread even
if the thread wants to exchange signal.

4. Current, in ptrace, we always resume all threads, but if a thread
has already a TDF_DBSUSPEND flag set by debugger, it won't run.

Encouraged by: marcel, julian, deischen


# 132082 13-Jul-2004 alc

Push down the acquisition and release of the page queues lock into
pmap_remove_pages(). (The implementation of pmap_remove_pages() is
optional. If pmap_remove_pages() is unimplemented, the acquisition and
release of the page queues lock is unnecessary.)

Remove spl calls from the alpha, arm, and ia64 pmap_remove_pages().


# 132016 12-Jul-2004 marcel

Implement the PT_LWPINFO request. This request can be used by the
tracing process to obtain information about the LWP that caused the
traced process to stop. Debuggers can use this information to select
the thread currently running on the LWP as the current thread.

The request has been made compatible with NetBSD for as much as
possible. This implementation differs from NetBSD in the following
ways:
1. The data argument is allowed to be smaller than the size of the
ptrace_lwpinfo structure known to the kernel, but not 0. This
is opposite to what NetBSD allows. The reason for this is that
we can extend the structure without affecting older binaries.
2. On NetBSD the tracing process is to set the pl_lwpid field to
the Id of the LWP it wants information of. We don't do that.
Our ptrace interface allows passing the LWP Id instead of the
PID. The tracing process is to set the PID to the LWP Id it
wants information of.
3. When the PID is actually the PID of the tracing process, this
request returns the information about the LWP that caused the
process to stop. This was the whole purpose of the request in
the first place.

When the traced process has exited, this request will return the
LWP Id 0, indicating that the process state is not the result of
an event specific to a LWP.


# 130847 21-Jun-2004 bde

(1) Removed the bogus condition "p->p_pid != 1" on calling sched_exit()
from exit1(). sched_exit() must be called unconditionally from exit1().
It was called almost unconditionally because the only exits on system
shutdown if at all.

(2) Removed the comment that presumed to know what sched_exit() does.
sched_exit() does different things for the ULE case. The call became
essential when it started doing load average stuff, but its caller
should not know that.

(3) Didn't fix bugs caused by bitrot in the condition. The condition was
last correct in rev.1.208 when it was in wait1(). There p was spelled
curthread->td_proc and was for the waiting parent; now p is for the
exiting child. The condition was to avoid lowering init's priority.
It should be in sched_exit() itself. Lowering of priorities is broken
in other ways in at least the 4BSD scheduler, and doing it for init
causes less noticeable problems than doing it for for shells.

Noticed by: julian (1)


# 130842 21-Jun-2004 bde

Update p_runtime on exit. This fixes calcru() on zombies, and prepares
for not calling calcru() on exit. calcru() on a zombie can happen if
ttyinfo() (^T) picks one.

PR: 52490


# 130684 18-Jun-2004 davidxu

Add comment to reflect that we should retry after thread singling failed.


# 130675 18-Jun-2004 davidxu

Remove a bogus panic. It is possible more than one threads will
be suspended in thread_suspend_check, after they are resumed, all
threads will call thread_single, but only one can be success,
others should retry and will exit in thread_suspend_check.


# 130239 08-Jun-2004 tjr

Remove remnants of PGINPROF.


# 129989 02-Jun-2004 tjr

Move TDF_SA from td_flags to td_pflags (and rename it accordingly)
so that it is no longer necessary to hold sched_lock while
manipulating it.

Reviewed by: davidxu


# 129750 26-May-2004 tmm

Retire cpu_sched_exit(); it is not used any more.


# 129547 21-May-2004 davidxu

Clear KSE thread flags after KSE thread mode is ended. The side effect
of not clearing the flags for execv() syscall will result that a new
program runs in KSE thread mode without enabling it.

Submitted by: tjr
Modified by: davidxu


# 129074 09-May-2004 julian

Remove misplaced duplicate comment and slightly reformat the
version that was in the right place.


# 127911 05-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127181 18-Mar-2004 green

Add the missing Giant when doing anything with VFS -- in this case,
releasing the ktrace vnode.


# 127140 17-Mar-2004 jhb

- Replace wait1() with a kern_wait() function that accepts the pid,
options, status pointer and rusage pointer as arguments. It is up to
the caller to copyout the status and rusage to userland if needed. This
lets us axe the 'compat' argument and hide all that functionality in
owait(), by the way. This also cleans up some locking in kern_wait()
since it no longer has to drop locks around copyout() since all the
copyout()'s are deferred.
- Convert owait(), wait4(), and the various ABI compat wait() syscalls to
use kern_wait() rather than wait1() or wait4(). This removes a bit
more stackgap usage.

Tested on: i386
Compiled on: i386, alpha, amd64


# 126941 14-Mar-2004 peter

Make the process_exit eventhandler run without Giant. Add Giant hooks
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.


# 126932 13-Mar-2004 peter

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 126672 05-Mar-2004 jhb

- Push down Giant in exit() and wait().
- Push Giant down a bit in coredump() and call coredump() with the proc
lock already held rather than unlocking it only to turn around and
relock it.

Requested by: peter


# 126325 27-Feb-2004 jhb

Drop sched_lock around the wakeup of the parent process after setting
the process state to zombie when a process exits to avoid a lock order
reversal with the sleepqueue locks. This appears to be the only place
that we call wakeup() with sched_lock held.


# 125988 19-Feb-2004 truckman

A Linux thread created using clone() should not send SIGCHLD to its
parent if no signal is specified in the clone() flags argument.

PR: 42457
MFC after: 2 weeks


# 125720 11-Feb-2004 truckman

When reparenting a process to init, make sure that p_sigparent is
set to SIGCHLD. This avoids the creation of orphaned Linux-threaded
zombies that init is unable to reap. This can occur when the parent
process sets its SIGCHLD to SIG_IGN. Fix a similar situation in the
PT_DETACH code.

Tested by: "Steven Hartland" <killing AT multiplay.co.uk>


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 124802 21-Jan-2004 rwatson

Reduce gratuitous includes: don't include jail.h if it's not needed.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.

Result of: self injury involving adding something to struct prison


# 122686 14-Nov-2003 cognet

Better fix than my previous commit:
in exit1(), make sure the p_klist is empty after sending NOTE_EXIT.
The process won't report fork() or execve() and won't be able to handle
NOTE_SIGNAL knotes anyway.
This fixes some race conditions with do_tdsignal() calling knote() while
the process is exiting.

Reported by: Stefan Farfeleder <stefan@fafoe.narf.at>
MFC after: 1 week


# 116361 14-Jun-2003 davidxu

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 116123 09-Jun-2003 jhb

Wait for the real interval timer callout handler to finish executing if it
is currently executing when we try to remove it in exit1(). Without this,
it was possible for the callout to bogusly rearm itself and eventually
refire after the process had been free'd resulting in a panic.

PR: kern/51964
Reported by: Jilles Tjoelker <jilles@stack.nl>
Reviewed by: tegge, bde


# 114983 13-May-2003 jhb

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 114461 01-May-2003 jhb

Initialize and destroy the struct proc mutex in the proc zone's init and
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.


# 113920 23-Apr-2003 jhb

- Protect p_numthreads with the sched_lock.
- Protect p_singlethread with both the sched_lock and the proc lock.
- Protect p_suspcount with the proc lock.


# 113795 21-Apr-2003 davidxu

Fix lock order reversal problem.


# 113627 17-Apr-2003 jhb

Adjust a few comments.


# 113355 11-Apr-2003 jeff

- Adjust sched hooks for fork and exec to take processes as arguments instead
of ksegs since they primarily operation on processes.
- KSEs take ticks so pass the kse through sched_clock().
- Add a sched_class() routine that adjusts a ksegrp pri class.
- Define a sched_fork_{kse,thread,ksegrp} and sched_exit_{kse,thread,ksegrp}
that will be used to tell the scheduler about new instances of these
structures within the same process. These will be used by THR and KSE.
- Change sched_4bsd to reflect this API update.


# 112910 31-Mar-2003 jeff

- Borrow the KSE single threading code for exec and exit. We use the check
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.


# 112888 31-Mar-2003 jeff

- Move p->p_sigmask to td->td_sigmask. Signal masks will be per thread with
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.


# 112564 24-Mar-2003 jhb

Replace the at_fork, at_exec, and at_exit functions with the slightly more
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.

Reviewed by: phk, jake, almost complete silence on arch@


# 112391 18-Mar-2003 des

Unregisterize, ansify.


# 112389 18-Mar-2003 des

Whitespace cleanup.


# 112198 13-Mar-2003 jhb

- Cache a reference to the credential of the thread that starts a ktrace in
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.

Requested by: rwatson (1)


# 112170 12-Mar-2003 tjr

Tidy up previous change: move comment about obtaining an exclusive
reference where it belongs, and remove a blank line to make it more
obvious what the comment applies to.


# 112139 12-Mar-2003 tjr

In wait1(), remove the zombie process from zombproc before removing
it from its pgrp to avoid leaving zombies around with p_pgrp == NULL.
This bug was apparent as a NULL-dereference in the pid selection code
in fork1().


# 112079 11-Mar-2003 davidxu

This is a force-commit for:
kern_sig.c 1.215
kern_thread.c 1.103
kern_exit.c 1.199
proc.h 1.302

Orignal code would suspend an already suspended thread,
if user presses ^Z while a threaded program is running. Also
there is a race between job control and thread_exit(), the
new code tests job control requesting before thread exits,
in wait() syscall, be sure to check child process is fully
stopped, this avoids a later SIGCHILD and returns STOPPED
status twice for a threaded child proc. A thread_stopped()
function is added for common code in several places.


# 112071 10-Mar-2003 davidxu

Fix threaded process job control bug. SMP tested.

Reviewed by: julian


# 111585 27-Feb-2003 julian

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 111034 17-Feb-2003 tjr

Use the proc lock to protect p_realtimer instead of Giant, and obtain
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.


# 111028 17-Feb-2003 jeff

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 110906 15-Feb-2003 alfred

Fix LOR with PROC/filedesc. Introduce fdesc_mtx that will be used as a
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.


# 110530 08-Feb-2003 julian

A little infrastructure, preceding some upcoming changes
to the profiling and statistics code.

Submitted by: DavidXu@
Reviewed by: peter@


# 110190 01-Feb-2003 julian

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 109877 26-Jan-2003 davidxu

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109205 13-Jan-2003 dillon

It is possible for an active aio to prevent shared memory from being
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.

Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().

MFC after: 7 days


# 107912 15-Dec-2002 dillon

Fix a refcount race with the vmspace structure. In order to prevent
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.

This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.

MFC after: 3 weeks


# 107719 10-Dec-2002 julian

Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage
during the context switch. Rearrange thread cleanups
to avoid problems with Giant. Clean threads when freed or
when recycled.

Approved by: re (jhb)


# 107216 25-Nov-2002 alc

Acquire and release the page queues lock around pmap_remove_pages() because
it updates several of vm_page's fields.


# 107105 20-Nov-2002 rwatson

Introduce p_label, extensible security label storage for the MAC framework
in struct proc. While the process label is actually stored in the
struct ucred pointed to by p_ucred, there is a need for transient
storage that may be used when asynchronous (deferred) updates need to
be performed on the "real" label for locking reasons. Unlike other
label storage, this label has no locking semantics, relying on policies
to provide their own protection for the label contents, meaning that
a policy leaf mutex may be used, avoiding lock order issues. This
permits policies that act based on historical process behavior (such
as audit policies, the MAC Framework port of LOMAC, etc) can update
process properties even when many existing locks are held without
violating the lock order. No currently committed policies implement use
of this label storage.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 105141 14-Oct-2002 jhb

- Add a new global mutex 'ppeers_lock' to protect the p_peers list of
processes forked with RFTHREAD.
- Use a goto to a label for common code when exiting from fork1() in case
of an error.
- Move the RFTHREAD linkage setup code later in fork since the ppeers_lock
cannot be locked while holding a proc lock. Handle the race of a task
leader exiting and killing its peers while a peer is forking a new child.
In that case, go ahead and let the peer process proceed normally as the
parent is about to kill it. However, the task leader may have already
gone to sleep to wait for the peers to die, so the new child process may
not receive a SIGKILL from the task leader. Rather than try to destruct
the new child process, just go ahead and send it a SIGKILL directly and
add it to the p_peers list. This ensures that the task leader will wait
until both the peer process doing the fork() and the new child process
have received their KILL signals and exited.

Discussed with: truckman (earlier versions)


# 104964 12-Oct-2002 jeff

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


# 104695 09-Oct-2002 julian

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 104390 02-Oct-2002 julian

Whitespace fix only


# 104306 01-Oct-2002 jmallett

Back our kernel support for reliable signal queues.

Requested by: rwatson, phk, and many others


# 104245 30-Sep-2002 jmallett

(Forced commit, to clarify previous commit of ksiginfo/signal queue code.)

I've added a structure, kernel-private, to represent a pending or in-delivery
signal, called `ksiginfo'. It is roughly analogous to the basic information
that is exported by the POSIX interface 'siginfo_t', but more basic. I've
added functions to allocate these structures, and further to wrap all signal
operations using them.

Once the operations are wrapped, I've added a TailQ (see queue(3)) of these
structures to 'struct proc', and all pending signals are in that TailQ. When
a signal is being delivered, it is dequeued from the list. Once I finish
the spreading of ksiginfo throughout the tree, the dequeued structure will be
delivered to the process in question, whereas currently and normally, the
signal number is what is used.


# 104233 30-Sep-2002 jmallett

First half of implementation of ksiginfo, signal queues, and such. This
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.

After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.

CODAFS is unfinished in this regard because the logic is unclear in
some places.

Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]


# 103767 21-Sep-2002 jake

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# 103410 16-Sep-2002 mini

Add kernel support needed for the KSE-aware libpthread:
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.

Reviewed by: bde, deischen, julian
Approved by: -arch


# 103367 15-Sep-2002 julian

Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.


# 103002 06-Sep-2002 julian

Use UMA as a complex object allocator.
The process allocator now caches and hands out complete process structures
*including substructures* .

i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.

For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.

Reviewed by: davidxu@freebsd.org, peter@freebsd.org


# 102950 05-Sep-2002 davidxu

s/SGNL/SIG/
s/SNGL/SINGLE/
s/SNGLE/SINGLE/

Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.

Approved by: julian (mentor)


# 101156 01-Aug-2002 jhb

Revert previous revision which accidentally snuck in with another commit.
It just removed a comment that doesn't make sense to me personally.


# 101153 01-Aug-2002 jhb

If we fail to write to a vnode during a ktrace write, then we drop all
other references to that vnode as a trace vnode in other processes as well
as in any pending requests on the todo list. Thus, it is possible for a
ktrace request structure to have a NULL ktr_vp when it is destroyed in
ktr_freerequest(). We shouldn't call vrele() on the vnode in that case.

Reported by: bde


# 99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 99009 28-Jun-2002 alfred

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 98765 24-Jun-2002 jake

Add an MD callout like cpu_exit, but which is called after sched_lock is
obtained, when all other scheduling activity is suspended. This is needed
on sparc64 to deactivate the vmspace of the exiting process on all cpus.
Otherwise if another unrelated process gets the exact same vmspace structure
allocated to it (same address), its address space will not be activated
properly. This seems to fix some spontaneous signal 11 problems with smp
on sparc64.


# 97997 07-Jun-2002 jhb

Properly lock accesses to p_tracep and p_traceflag. Also make a few
ktrace-only things #ifdef KTRACE that were not before.


# 97714 01-Jun-2002 mike

Add POSIX.1-2001 WCONTINUED option for waitpid(2). A proc flag
(P_CONTINUED) is set when a stopped process receives a SIGCONT and
cleared after it has notified a parent process that has requested
notification via waitpid(2) with WCONTINUED specified in its options
operand. The status value can be checked with the new WIFCONTINUED()
macro.

Reviewed by: jake


# 97157 23-May-2002 jhb

Whitespace: trim a trailing tab.


# 96122 06-May-2002 alfred

Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.


# 96118 06-May-2002 jhb

When checking to see if the init process calls exit1(), compare p to the
initproc proc pointer instead of checking to see if the pid is 1.

Submitted by: bde


# 96117 06-May-2002 jhb

Style fixes in local variable declarations.

Submitted by: bde


# 96115 06-May-2002 jhb

- Style fixes in some comments.
- Whitespace nit.
- Sort some includes.

Submitted by: bde (mostly)


# 96017 04-May-2002 alfred

style(9): 'if' and 'while' need a space after them.


# 95969 03-May-2002 tanimura

Fix the lock order reversal between the sigio lock and a process/pgrp lock in
funsetownlst() by locking the sigio lock across funsetownlst().


# 95937 02-May-2002 jhb

- Reorder a few things so that when we lock the process at the end of
exit1() we don't have to release it until we acquire schd_lock to
call cpu_throw().
- Since we can switch at any time due to preemption or a lock release
prior to acquiring sched_lock, don't update switchtime and switchticks
until the very end of exit1() after we have acquired sched_lock.
- Interlock the proctree_lock and proc lock in wait1() and exit1() to
avoid lost wakeups when a parent blocks waiting for a child to exit at
the bottom of wait1(). In exit1() the proc lock interlocked with
proctree_lock (and released after acquiring sched_lock) is that of
the parent process.
- In wait1() use an exclusive lock of proctree lock while we are
looking for a process to harvest. This allows us to completely
remove all references to the process once we've found one (i.e.,
disconnect it from pgrp's, session's, zombproc list, and it's parent's
children list) "atomically" without needing to worry about a lock
upgrade.
- We don't need sched_lock to test if p_stat is SZOMB or SSTOP when holding
the proc lock since the proc lock is always held with p_stat is set to
SZOMB or SSTOP.
- Protect nprocs with an xlock of the allproc_lock.


# 95594 27-Apr-2002 iedowse

Avoid the user-visible effect of setting SA_NOCLDWAIT when the
SIGCHLD handler is SIG_IGN. This is a reimplementation of the
problematic revision 1.131 of kern_exit.c. To avoid accessing process
UPAGES, we set a new procsig flag when the SIGCHLD handler is SIG_IGN
and use that instead.


# 94858 16-Apr-2002 jhb

- Lock proctree_lock instead of pgrpsess_lock.
- Exclusively lock proctree_lock while calling leavepgrp().


# 94302 09-Apr-2002 jhb

We don't need Giant to read the pgrp ID since the proc lock has protected
p_pgrp since the pgrp locking went in. We also don't need it to check for
invalid values in the options argument to wait1(), so push Giant down
slightly.


# 93471 31-Mar-2002 alfred

Close some holes with p->p_args by NULL'ing out the p->p_args pointer
while holding the proc lock, and by holding the pargs structure when
accessing it from outside of the owner.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# 93295 27-Mar-2002 alfred

Make the reference counting of 'struct pargs' SMP safe.

There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 92723 19-Mar-2002 alfred

Remove __P.


# 92068 11-Mar-2002 tanimura

Do not lock the pgrpsess_lock exclusively across ttywait().

Spotted by: David Wolfskill <david@catwhisker.org>
Investigated by: rwatson


# 91140 23-Feb-2002 tanimura

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 91066 22-Feb-2002 phk

Convert p->p_runtime and PCPU(switchtime) to bintime format.


# 90263 05-Feb-2002 alfred

Fix a race with free'ing vmspaces at process exit when vmspaces are
shared.

Also introduce vm_endcopy instead of using pointer tricks when
initializing new vmspaces.

The race occured because of how the reference was utilized:
test vmspace reference,
possibly block,
decrement reference

When sharing a vmspace between multiple processes it was possible
for two processes exiting at the same time to test the reference
count, possibly block and neither one free because they wouldn't
see the other's update.

Submitted by: green


# 88941 05-Jan-2002 dwmalone

Release text vnode in exit() rather than wait(). Occasionally
fifesystem problems could prevent the release from completing and
this could result in init being blocked indefinitely.

This was looked over by Matt ages ago.

Approved by: dillon


# 88900 05-Jan-2002 jhb

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


# 88715 30-Dec-2001 alc

Eliminate semexit_hook using at_exit(9) and rm_at_exit(9).

Reviewed by: alfred


# 88633 29-Dec-2001 alfred

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 85864 02-Nov-2001 phk

#ifdef KTRACE a variable to silence a warning.

Submitted by: Maxime "mux" Henrion <mux@qualys.com>


# 85721 30-Oct-2001 julian

Use the thread we have instead of finding another
that may be the wrong one.


# 85525 26-Oct-2001 jhb

Add a per-thread ucred reference for syscalls and synchronous traps from
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.

Tested on: x86 (SMP), alpha, sparc64


# 85397 23-Oct-2001 dillon

Fix ktrace enablement/disablement races that can result in a vnode
ref count panic.

Bug noticed by: ps
Reviewed by: ps
MFC after: 1 day


# 85388 23-Oct-2001 jhb

Change the sx(9) assertion API to use a sx_assert() function similar to
mtx_assert(9) rather than several SX_ASSERT_* macros.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 83276 10-Sep-2001 peter

Rip some well duplicated code out of cpu_wait() and cpu_exit() and move
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.

XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.

Reviewed by: jake, tmm, dillon


# 82714 01-Sep-2001 dillon

Giant pushdown sys_exit(), [o]wait(), wait4()


# 81338 09-Aug-2001 peter

Forced commit; Previous commit was to remove check_sigacts() that was
added with the linuxthreads commit but appears to have never been used.


# 81332 08-Aug-2001 peter

*** empty log message ***


# 80974 01-Aug-2001 peter

Temporarily back out kern_sig.c rev 1.125 and kern_exit.c rev 1.131.
This paniced my one of my machines one time too many :-( and there is
no sign of a solution in the pipeline. The deltas are still easily
available in cvs. The problem is that if the parent has been swapped
out, the child process cannot grope around in the parent's UPAGES to
see the sigact[] array or it will fault. This probably is a showstopper
for this implementation anyway.


# 80156 22-Jul-2001 dillon

As per further discussions on hackers redo the SIGCHLD patch to not generate
an unexpected user-visible side effect with the sigaction flags. Also cleanup
a minor union issue.

Submitted by: Rudolf Cejka <cejkar@dcse.fee.vutbr.cz>
MFC addendum: MFC will be combined w/ original commit
MFC after: 3 days


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 78870 27-Jun-2001 jhb

- Always use the proc lock of the task leader to protect the peers list of
processes.
- Don't construct fake call args and then call kill(). psignal is not
anymore complicated and is quicker and not prone to locking problems.
Calling psignal() avoids having to do a pfind() since we already have a
proc pointer and also allows us to keep the task leader locked while we
kill all the peer processes so the list is kept coherent.
- When a kthread exits, do a wakeup() on its proc pointers. This can be
used by kernel modules that have kthreads and want to ensure they have
safely exited before completely the MOD_UNLOAD event.

Connectivity provided by: Usenix wireless


# 77183 25-May-2001 rwatson

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76271 04-May-2001 jhb

Don't hold the process mutex across calls to FREE() since the vm system
uses lockmgr locks and this leads to a lock order reversal. At this point
in wait1() the process is not on any process lists or in the process tree,
so no other process should be able to find it or have a reference to it
anyways, so the locking is not needed.


# 75950 25-Apr-2001 tanimura

Do not leave a process with no credential in zombproc.

Reviewed by: jhb


# 75893 23-Apr-2001 jhb

Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake


# 74927 28-Mar-2001 jhb

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# 74914 28-Mar-2001 jhb

Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>


# 73908 07-Mar-2001 jhb

- Call proc_reparent() when handing a process off to init in exit rather
than dinking around in the process lists explicitly.
- Hold both the proctree lock and proc lock of the child process when
reparenting a process via proc_reparent.
- Lock processes while sending them signals.
- Miscellaenous proc locking.
- proc_reparent() now asserts that the child is locked in addition to an
exclusive proctree lock.


# 72921 22-Feb-2001 tegge

Streamline updating of switchtime (don't copy code from kern_sync.c).

Submitted by: jhb


# 72919 22-Feb-2001 tegge

Protect update of the per processor switchtime variable against
interrupts.

Protect usage of the per processor switchtime variable against
interrupts in calcru().

This seem to eliminate the "microuptime() went backwards" warnings.


# 72786 21-Feb-2001 rwatson

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 72255 09-Feb-2001 jhb

Revert the previous revision for two reasons:
- I can't seem to reproduce the warning I got from WITNESS anymore.
- The fix was wrong. Since a uidinfo struct is a member of proc, it
makes sense for the locking order to be such that you are allowed to
hold proc and then grab the uidinfo lock.


# 72229 09-Feb-2001 jhb

Release the proc lock around crfree() and uifree() in wait1(). It leads to
a lock order violation, and since p is already a zombie at this point,
I'm not sure that we even need all the locking currently in wait1().


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 71499 23-Jan-2001 jhb

- Proc locking.
- Protect calcru() with sched_lock.


# 70861 10-Jan-2001 jake

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 70317 23-Dec-2000 jake

Protect proc.p_pptr and proc.p_children/p_sibling with the
proctree_lock.

linprocfs not locked pending response from informal maintainer.

Reviewed by: jhb, -smp@


# 70143 18-Dec-2000 jake

Whitespace. Fix a comment block and an if statement that were wider
than 80 characters.


# 69947 12-Dec-2000 jake

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 69565 04-Dec-2000 jake

Remove if defined(tahoe) cobwebs.


# 69537 02-Dec-2000 jhb

- Add a mutex to the proc structure p_mtx that will be used to lock accesses
to each individual proc.
- Initialize the lock during fork1(), and destroy it in wait1().


# 69485 01-Dec-2000 jhb

Protect p_stat with sched_lock.


# 69435 01-Dec-2000 jhb

Don't update p_stat in exit1() to SZOMB until after releasing the allproc
lock. Otherwise, if we block on the backing mutex while releasing the
allproc lock, then when we resume, we will be at SRUN, and we will stay
that way all the way through cpu_exit. As a result, our parent will never
harvest us.


# 69286 27-Nov-2000 jake

Use callout_reset instead of timeout(9). Most callouts are statically
allocated, 2 have been added to struct proc for setitimer and sleep.

Reviewed by: jhb, jlemon


# 69022 22-Nov-2000 jake

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 67365 20-Oct-2000 jhb

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 65980 17-Sep-2000 bde

Added used include of <sys/mutex.h> (don't depend on pollution in
<sys/signalvar.h>).


# 65557 06-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 65495 05-Sep-2000 truckman

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 63986 28-Jul-2000 peter

Change the 'exit()' system call to 'sys_exit()'. This avoids overlapping
gcc's internal exit() prototypes and the (futile) hackery that we did to
try and avoid warnings. main() was renamed for similar reasons.
Remove an exit related hack from makesyscalls.sh.


# 61976 22-Jun-2000 alfred

fix races in the uidinfo subsystem, several problems existed:

1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.

fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.

2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.

fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.

Reviewed by: green


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 60759 21-May-2000 green

Back out NOTE_EXIT status reporting pending discussion.


# 60659 16-May-2000 green

Put the wait(2) exit status in "data" for NOTE_EXIT kevents.


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 55707 10-Jan-2000 sef

Handle the case where we truss an SUGID program -- in particular, we need
to wake up any processes waiting via PIOCWAIT on process exit, and truss
needs to be more aware that a process may actually disappear while it's
waiting.

Reviewed by: Paul Saab <ps@yahoo-inc.com>


# 53823 28-Nov-1999 bde

Scheduler fixes equivalent to the ones logged in the following NetBSD
commit to kern_synch.c:

----------------------------
revision 1.55
date: 1999/02/23 02:56:03; author: ross; state: Exp; lines: +39 -10
Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclk() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclk() mechansism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock. Define these for alpha, where hz==1024, and nice was
totally broke.

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater wieght) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this, it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.
----------------------------

The details are a little different in FreeBSD:

=== nice bug === Fixing this is the main point of this commit. We use
essentially the same clipping rule as NetBSD (our limit on p_estcpu
differs by a scale factor). However, clipping at all is fundamentally
bad. It gives free CPU the hoggiest hogs once they reach the limit, and
reaching the limit is normal for long-running hogs. This will be fixed
later.

=== New schedclk() mechanism === We don't use the NetBSD schedclk()
(now schedclock()) mechanism. We require (real)stathz to be about 128
and scale by an extra factor of 2 compared with NetBSD's statclock().
We scale p_estcpu instead of scaling the clock. This is more accurate
and flexible.

=== Algorithm change === Same change.

=== Other bugs === The p_pctcpu bug was fixed long ago. We don't try as
hard to abstract functionality yet.

Related changes: the new limit on p_estcpu must be exported to kern_exit.c
for clipping in wait1().

Agreed with by: dufault


# 53503 21-Nov-1999 phk

s/p_cred->pc_ucred/p_ucred/g


# 53432 19-Nov-1999 phk

The at_exit and at_fork functions currently use a 'roll your own'
linked list to store the callbak routines. The patch converts the
lists to queue(3) TAILQs, making the code slightly clearer and ensuring
that callbacks are executed in FIFO order.

Man page also updated as necesary.

(discontinued use of M_TEMP malloc type while here anyway /phk)

Submitted by: Jake Burkholder jake@checker.org
PR: 14912


# 53239 16-Nov-1999 phk

Introduce commandline caching in the kernel.

This fixes some nasty procfs problems for SMP, makes ps(1) run much faster,
and makes ps(1) even less dependent on /proc which will aid chroot and
jails alike.

To disable this facility and revert to previous behaviour:
sysctl -w kern.ps_arg_cache_limit=0

For full details see the current@FreeBSD.org mail-archives.


# 53212 16-Nov-1999 phk

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 52140 11-Oct-1999 luoqi

Add a per-signal flag to mark handlers registered with osigaction, so we
can provide the correct context to each signal handler.

Fix broken sigsuspend(): don't use p_oldsigmask as a flag, use SAS_OLDMASK
as we did before the linuxthreads support merge (submitted by bde).

Move ps_sigstk from to p_sigacts to the main proc structure since signal
stack should not be shared among threads.

Move SAS_OLDMASK and SAS_ALTSTACK flags from sigacts::ps_flags to proc::p_flag.
Move PS_NOCLDSTOP and PS_NOCLDWAIT flags from proc::p_flag to procsig::ps_flag.

Reviewed by: marcel, jdp, bde


# 52127 11-Oct-1999 peter

Clean up some cruft. We don't run <= 4.3 binaries on hp300 or luna68k
arches using owait(2).


# 51791 29-Sep-1999 marcel

sigset_t change (part 2 of 5)
-----------------------------

The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.

The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.

struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.

signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.

Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.

NOTE: kdump (and port linux_kdump) must be recompiled.

Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50465 27-Aug-1999 marcel

Add sysctl variables for the Linuxulator. These reside under `compat.linux' as
discussed on current.

The following variables are defined (for now):

osname (defaults to "Linux")
Allow users to change the name of the OS as returned by uname(2),
specially added for all those Linux Netscape users and statistics
maniacs :-) We now have what we all wanted!

osrelease (defaults to "2.2.5")
Allow users to change the version of the OS as returned by uname(2).
Since -current supports glibc2.1 now, change the default to 2.2.5
(was 2.0.36).

oss_version (defaults to 198144 [0x030600])
This one will be used by the OSS_GETVERSION ioctl (PR 12917) which I
can commit now that we have the MIB. The default version number is the
lowest version possible with the current 'encoding'.

A note about imprisoned processes (see jail(2)):
These variables are copy-on-write (as suggested by phk). This means that
imprisoned processes will use the system wide value unless it is written/set
by the process. From that moment on, a copy local to the prison will be
used.

A note about the implementation:
I choose to add a single pointer to struct prison, because I didn't like the
idea of changing struct prison every time I come up with a new variable. As
a side effect, the extra storage is only needed when a variable is set from
within the prison. This also minimizes kernel bloat when the Linuxulator is
not used; both compiled in or as a module.

Reviewed by: bde (first version only) and phk


# 47829 07-Jun-1999 msmith

From the submitter:

- this causes POSIX locking to use the thread group leader
(p->p_leader) as the locking thread for all advisory locks.
In non-kernel-threaded code p->p_leader == p, so this will have
no effect.

This results in (more) correct POSIX threaded flock-ing semantics.

It also prevents the leader from exiting before any of the children.
(so that p->p_leader will never be stale) in exit1().

We have been running this patch for over a month now in our lab
under load and at customer sites.

Submitted by: John Plevyak <jplevyak@inktomi.com>


# 46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 46129 27-Apr-1999 luoqi

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# 45739 17-Apr-1999 peter

Well folks, this is it - The second stage of the removal for build support
for LKM's..


# 44673 11-Mar-1999 bde

Fixed runtime accounting. The time since the previous context switch
was discarded on every call to calcru(). Hacking on the `switchtime'
global for a related fix in rev.1.38 of kern_resource.c was too fragile
and broke when p_switchtime went away.

PR: 10402


# 44384 01-Mar-1999 julian

Fix thread/process tracking and differentiation for Linux threads emulation.

Submitted by: Richard Seaman, Jr." <dick@tar.com>

Also clean some compiler warnings in surrounding code.


# 44146 19-Feb-1999 luoqi

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 43448 31-Jan-1999 newton

Added comments about non-staticization so it doesn't get un-done next
time someone goes on a staticization binge.

Suggested by: eivind


# 43408 30-Jan-1999 newton

Unstaticized routines which are needed by the svr4 KLD and the streams
garbage needed to support SysVR4 networking.


# 43208 26-Jan-1999 julian

Enable Linux threads support by default.
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 42379 07-Jan-1999 julian

Changes to the LINUX_THREADS support to only allocate extra memory for
shared signal handling when there is shared signal handling being
used.

This removes the main objection to making the shared signal handling
a standard ability in rfork() and friends and 'unconditionalising'
this code. (i.e. the allocation of an extra 328 bytes per process).

Signal handling information remains in the U area until such a time as
it's reference count would be incremented to > 1. At that point a new
struct is malloc'd and maintained in KVM so that it can be shared between
the processes (threads) using it.

A function to check the reference count and move the struct back to the U
area when it drops back to 1 is also supplied. Signal information is
therefore now swapable for all processes that are not sharing that
information with other processes. THis should addres the concerns raised
by Garrett and others.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 41931 19-Dec-1998 julian

Reviewed by: Luoqi Chen, Jordan Hubbard
Submitted by: "Richard Seaman, Jr." <lists@tar.com>
Obtained from: linux :-)

Code to allow Linux Threads to run under FreeBSD.

By default not enabled
This code is dependent on the conditional
COMPAT_LINUX_THREADS (suggested by Garret)
This is not yet a 'real' option but will be within some number of hours.


# 41086 11-Nov-1998 truckman

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 41059 10-Nov-1998 peter

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


# 36676 05-Jun-1998 dg

Moved limit frobbing (and the resulting limcopy()) that occurs for
accounting to the accounting function so that this isn't needlessly
done for some process exits.
Reviewed by: bde,phk


# 35058 06-Apr-1998 phk

Make a kernel version of the timer* functions called timerval* to be
more consistent.

OK'ed by: bde


# 32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 31778 16-Dec-1997 eivind

Make COMPAT_43 and COMPAT_SUNOS new-style options.


# 31618 07-Dec-1997 sef

Use at_exit() to invoke procfs_exit() instead of calling it directly.
Note that an unload facility should be used to call rm_at_exit() (if
procfs is being loaded as an LKM and is subsequently removed), but it
was non-obvious how to do this in the VFS framework.

Reviewed by: Julian Elischer


# 31610 07-Dec-1997 sef

Surround the call to procfs_exit() by #ifdef PROCFS/#endif -- much to my
surprise, procfs actually is optional, and some people truly do generate
kernels without it. Wow. I built a kernel without 'options PROCFS' and
it compiled and linked.


# 31564 06-Dec-1997 sef

Changes to allow event-based process monitoring and control.


# 31320 20-Nov-1997 bde

Avoid passing a `retval' to wait1()

Disallow wait options that are not a combination of the standard POSIX
options WUNTRACED and WNOHANG, as is required by POSIX. BSD doesn't
have any extensions here, but the code was `#ifdef notyet' for some
reason.


# 30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 29680 21-Sep-1997 gibbs

init_main.c subr_autoconf.c:
Add support for "interrupt driven configuration hooks".
A component of the kernel can register a hook, most likely
during auto-configuration, and receive a callback once
interrupt services are available. This callback will occur before
the root and dump devices are configured, so the configuration
task can affect the selection of those two devices or complete
any tasks that need to be performed prior to launching init.
System boot is posponed so long as a hook is registered. The
hook owner is responsible for removing the hook once their task
is complete or the system boot can continue.

kern_acct.c kern_clock.c kern_exit.c kern_synch.c kern_time.c:
Change the interface and implementation for the kernel callout
service. The new implemntaion is based on the work of
Adam M. Costello and George Varghese, published in a technical
report entitled "Redesigning the BSD Callout and Timer Facilities".
The interface used in FreeBSD is a little different than the one
outlined in the paper. The new function prototypes are:

struct callout_handle timeout(void (*func)(void *),
void *arg, int ticks);

void untimeout(void (*func)(void *), void *arg,
struct callout_handle handle);

If a client wishes to remove a timeout, it must store the
callout_handle returned by timeout and pass it to untimeout.

The new implementation gives 0(1) insert and removal of callouts
making this interface scale well even for applications that
keep 100s of callouts outstanding.

See the updated timeout.9 man page for more details.


# 29340 13-Sep-1997 joerg

Implement SA_NOCLDWAIT.

The implementation is done (unlike what i've originally been
contemplating) by reparenting kids of processes that have the
appropriate bit set to PID 1, and let PID 1 handle the zombie. This
is far less problematical than what would seem to be ``doing it
right'', for a number of reasons.

Of our currently shipping PID-1-intended programs, 50 % fail the above
assumption. ;-) (Read this: sysinstall doesn't do it right. This is
no problem as long as no program called by sysinstall actually uses
SA_NOCLDWAIT.)

ToDo: . clarify the correct SA_* flag inheritance, compared
to other systems,
. decide whether the compat cruft (osigvec(9)) should
deal with new system additions or not,
. merge OpenBSD's SA_SIGINFO implementation. ;)
Reviewed by: bde


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28767 25-Aug-1997 bde

Fixed some gratuitous ANSIisms.


# 28551 21-Aug-1997 bde

#include <machine/limits.h> explicitly in the few places that it is required.


# 27465 17-Jul-1997 dyson

Clean up some lint associated with the AIO code.


# 27221 06-Jul-1997 dyson

This is an upgrade so that the kernel supports the AIO calls from
POSIX.4. Additionally, there is some initial code that supports LIO.
This code supports AIO/LIO for all types of file descriptors, with
few if any restrictions. There will be a followup very soon that
will support significantly more efficient operation for VCHR type
files (raw.) This code is also dependent on some kernel features
that don't work under SMP yet. After I commit the changes to the
kernel to support proper address space sharing on SMP, this code
will also work under SMP.


# 26671 15-Jun-1997 dyson

Modifications to existing files to support the initial AIO/LIO and
kernel based threading support.


# 25999 22-May-1997 phk

Remove cruft relating to p_selbits and p_selbits_size


# 24691 07-Apr-1997 peter

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


# 24203 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21926 21-Jan-1997 davidn

Copy process resource settings before modifying.

Candidate for 2.2.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 18897 12-Oct-1996 dyson

Performance optimizations. One of which was meant to go in before the
previous snap. Specifically, kern_exit and kern_exec now makes a
call into the pmap module to do a very fast removal of pages from the
address space. Additionally, the pmap module now updates the PG_MAPPED
and PG_WRITABLE flags. This is an optional optimization, but helpful
on the X86.


# 18694 04-Oct-1996 julian

If we have no console device it is possible to be
1/ session leader
2/ Have a console device vnode (/dev/console)
3/ have NULL pointer for a consoel tty struct.

fix the only case where the tty struct is referenced without a prior
check for existance.


# 18277 13-Sep-1996 bde

Don't use __dead in the kernel. It was an obfuscation for gcc >= 2.5
and a no-op for gcc >= 2.6.


# 17768 22-Aug-1996 julian

Some cleanups to the callout lists recently added.
note that at_shutdown has a new parameter to indicate When
during a shutdown the callout should be made. also
add a RB_POWEROFF flag to reboot "howto" parameter..
tells the reboot code in our at_shutdown module to turn off the UPS
and kill the power. bound to be useful eventually on laptops


# 17702 20-Aug-1996 smpatel

Remove the kernel FD_SETSIZE limit for select().
Make select()'s first argument 'int' not 'u_int'.

Reviewed by: bde


# 17660 19-Aug-1996 julian

add callout lists for exit() and fork()

I've been meaning to do this for AGES as I keep having to patch those routines
whenever I write a proprietary package or similar..

any module that assigns resources to processes needs to know when
these events occur. there are existsing modules that should be modified
to take advantage of these.. e.g. SYSV IPC primatives
presently have #ifdef entries in exit()


this also helps with making LKMs out of such things..

(see the man pages at_exit(9) and at_fork(9))


# 17334 30-Jul-1996 dyson

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 17294 27-Jul-1996 dyson

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 16322 12-Jun-1996 gpalmer

Clean up -Wunused warnings.

Reviewed by: bde


# 15201 11-Apr-1996 bde

Spell cpu_switch() with an i in a comment.


# 15117 07-Apr-1996 bde

Removed never-used #includes of <machine/cpu.h>. Many were apparently
copied from bad examples.


# 14529 11-Mar-1996 hsu

From Lite2: proc LIST changes.
Reviewed by: david & bde


# 14510 11-Mar-1996 hsu

From NetBSD: add #include <sys/acct.h> for acct_process() prototype.
Reviewed by: davidg & bde


# 13523 20-Jan-1996 bde

Removed stale #includes of "opt_sysvipc.h".


# 13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 13332 08-Jan-1996 peter

(gulp!) reran makesyscalls..

sysv_ipc.c: add stub functions that either simply return (for the hooks
in kern_fork/kern_exit) or log() a messgae and call enosys() (for the
syscalls). sysv_ipc.c will become "standard" in conf/files and has
#ifs for all the permutations.


# 13226 04-Jan-1996 wollman

Convert SYSV IPC to new-style options. (I hope I got everything...)
The LKMs will need an extra file, to come later.


# 13203 03-Jan-1996 wollman

Converted two options over to the new scheme: USER_LDT and KTRACE.


# 13150 01-Jan-1996 peter

Only #include <sys/shm.h> if SYSVSHM (for shmexit() prototype)
Add missing #include <sys/sem.h> if SYSVSEM (for semexit() prototype)


# 13060 27-Dec-1995 joerg

Call semexit() from exit(), in order to process `undo vectors'.
This function has actually never been called.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12205 11-Nov-1995 bde

Removed unreachable code.

Changed `#if defined()' back to `#ifdef' to finish removing COMPAT_IBCS2.


# 11733 23-Oct-1995 swallace

Remove COMPAT_IBCS2 option.
Ibcs2 emulator no longer depends on owait() or any other hack to wait4().


# 11724 23-Oct-1995 bde

Simplify the pseudo-argument removal changes by not optimizing for
the !COMPAT_43 case - use a common function even when there is no
`old' function. The diffs for this are large because of code motion
to restore the function order to what it was before the pseudo-argument
changes.

Include <sys/sysproto.h> to get correct args structs and prototypes.
The diffs for this are large because the declarations of the args structs
were moved to become comments in the function headers. The comments may
actually match the automatically generated declarations right now.

Add prototypes.


# 11332 07-Oct-1995 swallace

Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.


# 11328 07-Oct-1995 swallace

Remove compat_43 psuedo-argument hack, and replace with a better hack.
Instead of using a fake "compat" argument, pass a real compat int to function
if COMPAT_43 is defined. Functions involved: wait4, accept, recvfrom,
getsockname.

With the compat psuedo-argument, this introduces an argument structure
that can have two possible sizes depending on compat options.
This makes life difficult for lkm modules like ibcs2, which would
have to guess what size used in kernel when compiled. Also,
the prototype generator for these structures cannot generate proper sizes.

Now there is only one fixed structure and makes everybody happy.

I recommend these changes be introduced to 2.1 so that ibcs2, linux
lkm's generated for 2.2 can still run on a 2.1 kernel.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 5260 28-Dec-1994 dg

Fixed multiple bugs that cause null pointers to be followed or FREEed data
to be accessed if a process blocks when it is being run down.


# 3921 27-Oct-1994 phk

Fix the panic message if init dies to show the exit status.


# 3512 11-Oct-1994 sos

Fixed bug in ibcs2 signal translation.


# 3473 09-Oct-1994 sos

Changed option IBCS2 to COMPAT_IBCS2 (for lkm support)


# 3451 09-Oct-1994 dg

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


# 3308 02-Oct-1994 phk

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3098 25-Sep-1994 phk

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 2687 12-Sep-1994 dg

Limit p_estcpu to UCHAR_MAX to keep it within reasonable bounds - else
it goes crazy (into the billions) during any lengthy build.

Submitted by: John Dyson, modified slightly by me.


# 2257 24-Aug-1994 sos

Changes preparing for iBCS support
Reviewed by:
Submitted by:


# 1884 06-Aug-1994 dg

Process scheduling changes - adapted from FreeBSD 1.1.5. Basically,
charge scheduling CPU of child process to the parent and have child
inherit scheduling CPU from parent on fork. Makes a **big** difference
in the feel of the system to interactive users.

Submitted by: John Dyson


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources