Cross Reference: /freebsd-9.3-release/sys/kern/subr

History log of /freebsd-9.3-release/sys/kern/subr_uio.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 267654	19-Jun-2014	gjb	Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-9.3-release
# 251874	17-Jun-2013	scottl	Big MFC of the physbio changes necessary for unmapped I/O. These changes have been in production at Netflix for several months with significant success. MFC r246713: Reform the busdma API so that new types may be added without modifying every architecture's busdma_machdep.c. It is done by unifying the bus_dmamap_load_buffer() routines so that they may be called from MI code. The MD busdma is then given a chance to do any final processing in the complete() callback. MFC r249538: Some compilers issue a warning when wider integer is casted to narrow pointer. Supposedly shut down the warning by casting through uintptr_t. MFC r251479: Simplify the checking of flags for cam_periph_mapmem(). This gets rid of a lot of code redundancy and grossness at very minor expense. MFC r251837: MFC r251842: Add infrastructure for doing compatibility shims, as has been sorely needed for the last 10 years. Far too much of the internal API is exposed, and every small adjustment causes applications to stop working. To kick this off, bump the API version to 0x17 as should have been done with r246713, but add shims to compensate. Thanks to the shims, there should be no visible change in application behavior. Submitted by: kib, jeffr Approved by: kib Obtained from: Netflix
# 240238	08-Sep-2012	kib	MFC r239065: Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it.
# 233647	29-Mar-2012	alc	MFC r233291 Handle spurious page faults that may occur in no-fault sections of the kernel.
# 233353	23-Mar-2012	kib	MFC r231949: Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. MFC r232493: Remove unneeded cast to u_int. The values as small enough to fit into int, beside the use of MIN macro which performs type promotions. MFC r232494: Instead of incomplete handling of read(2)/write(2) return values that does not fit into registers, declare that we do not support this case using CTASSERT(), and remove endianess-unsafe code to split return value into td_retval. While there, change the style of the sysctl debug.iosize_max_clamp definition. MFC r232495: pipe_read(): change the type of size to int, and remove signed clamp. pipe_write(): change the type of desiredsize back to int, its value fits.
# 225736	22-Sep-2011	kensmith	Copy head to stable/9 as part of 9.0-RELEASE release cycle. Approved by: re (implicit)
# 223889	09-Jul-2011	kib	Add a facility to disable processing page faults. When activated, uiomove generates EFAULT if any accessed address is not mapped, as opposed to handling the fault. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version)
# 220114	28-Mar-2011	kib	Fix the check for vm_map_remove() error. Pointed out by: alc MFC after: 2 weeks
# 220102	28-Mar-2011	kib	Trim white spaces, adjust style. MFC after: 2 weeks
# 220101	28-Mar-2011	kib	Handle zero length in copyout_unmap(). Submitted by: John Wehle <john feith com> MFC after: 2 weeks
# 220100	28-Mar-2011	kib	Promote ksyms_map() and ksyms_unmap() to general facility copyout_map() and copyout_unmap() interfaces. Submitted by: John Wehle <john feith com>, nox MFC after: 2 weeks
# 218425	07-Feb-2011	mdf	Remove the uio_yield prototype and symbol. This function has been misnamed since it was introduced and should not be globally exposed with this name. The equivalent functionality is now available using kern_yield(curthread->td_user_pri). The function remains undocumented. Bump __FreeBSD_version.
# 218424	07-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 218195	02-Feb-2011	mdf	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week
# 207796	08-May-2010	alc	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
# 207669	05-May-2010	alc	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
# 204176	21-Feb-2010	ed	Decompose the most lousy named file in sys/kern; kern_subr.c. Although this file has historically been used as a dumping ground for random functions, nowadays it only contains functions related to copying bits {from,to} userspace and hash table utility functions. Behold, subr_uio.c and subr_hash.c.
# 196454	23-Aug-2009	rpaulo	Constify prime numbers.
# 182399	28-Aug-2008	ed	Make ureadc() warn when holding any locks, just like uiomove(). A couple of months ago I was quite impressed, because when I was writing code, I discovered that uiomove() would not allow any locks to be held, while ureadc() did, mainly because ureadc() is implemented using the same building blocks as uiomove(). Let's see if this triggers any aditional witness warnings on our source tree. Reviewed by: atillio
# 178272	17-Apr-2008	jeff	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
# 170307	04-Jun-2007	jeff	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 166042	16-Jan-2007	rrs	Removes useless (flags \| ) KASSERT. The ^ one that actually does what we want. Submitted by: Li Xin delphij@delphij.net Reviewed by: rrs Approved by: gnn
# 166036	15-Jan-2007	kmacy	Fix warning by adding extra parentheses
# 166022	15-Jan-2007	rrs	Reviewed by: rwatson Approved by: gnn Add a new function hashinit_flags() which allows NOT-waiting for memory (or waiting). The old hashinit() function now calls hashinit_flags(..., HASH_WAITOK);
# 164936	06-Dec-2006	julian	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
# 163709	26-Oct-2006	jb	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# 161252	12-Aug-2006	alc	Reduce the scope of the page queues lock in vm_pgmoveco() now that vm_page_sleep_if_busy() no longer requires the page queue lock to be held. Correctly spell "TRUE".
# 139804	06-Jan-2005	imp	/* -> /*- for copyright notices, minor format tweaks as necessary
# 138781	13-Dec-2004	alc	Correct the handling of two unusual cases by the zero-copy receive path, specifically, vm_pgmoveco(): 1. If vm_pgmoveco() sleeps on a busy page, it must redo the look up because the page may have been freed. 2. If the receive buffer is copy-on-write due to, for example, a fork, then although the first vm object in the shadow chain may not contain a page there may still be one from a backing object that is mapped. Thus, a pmap_remove() is required for the new page rather than the backing object's page to been seen by the application. Also, add some comments to vm_pgmoveco() and update some assertions. Tested by: ken@
# 138539	08-Dec-2004	alc	Tidy up the zero-copy receive path: Remove an unneeded argument to uiomoveco() and userspaceco().
# 138424	05-Dec-2004	alc	Update the Tigon 1 and 2 driver to use the sf_buf API for implementing zero-copy receive of jumbo frames. This eliminates the need for the jumbo frame allocator implemented in kern/uipc_jumbo.c and sys/jumbo.h. Remove it. Note: Zero-copy receive of jumbo frames did not work without these changes; I believe there was insufficient locking on the jumbo vm object. Tested by: ken@ Discussed with: gallatin@
# 137377	08-Nov-2004	alc	Eliminate an unused argument to vm_pgmoveco().
# 137244	05-Nov-2004	alc	Two changes to vm_pgmoveco(): - Eliminate an initialized but unused variable. - Eliminate an unnecessary call to clear the page's PG_BUSY flag. (The call to vm_page_rename() already clears the page's PG_BUSY flag through its call to vm_page_remove().)
# 137168	03-Nov-2004	alc	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# 136444	12-Oct-2004	jhb	Add a WITNESS_WARN() to uiomove() to whine if locks are held when this function is called. MFC after: 1 month
# 131897	10-Jul-2004	phk	Clean up and wash struct iovec and struct uio handling. Add copyiniov() which copies a struct iovec array in from userland into a malloc'ed struct iovec. Caller frees. Change uiofromiov() to malloc the uio (caller frees) and name it copyinuio() which is more appropriate. Add cloneuio() which returns a malloc'ed copy. Caller frees. Use them throughout.
# 131473	02-Jul-2004	jhb	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
# 130028	03-Jun-2004	tjr	Remove checks for curthread == NULL - it can't happen.
# 130023	02-Jun-2004	tjr	Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid having to acquire sched_lock when manipulating it in lockmgr(), uiomove(), and uiomove_fromphys(). Reviewed by: jhb
# 127911	05-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 125423	04-Feb-2004	silby	Rename iov_to_uio to uiofromiov to be more consistent with other uio* functions. Suggested by: bde
# 125420	04-Feb-2004	silby	Style fixes Submitted by: bde
# 125316	02-Feb-2004	silby	Remove debugging code that slipped into the previous commit. Spotted by: bde
# 125296	01-Feb-2004	silby	Rewrite sendfile's header support so that headers are now sent in the first packet along with data, instead of in their own packet. When serving files of size (packetsize - headersize) or smaller, this will result in one less packet crossing the network. Quick testing with thttpd and http_load has shown a noticeable performance improvement in this case (350 vs 330 fetches per second.) Included in this commit are two support routines, iov_to_uio, and m_uiotombuf; these routines are used by sendfile to construct the header mbuf chain that will be linked to the rest of the data in the socket buffer.
# 124944	25-Jan-2004	jeff	- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or SW_INVOL. Assert that one of these is set in mi_switch() and propery adjust the rusage statistics. This is to simplify the large number of users of this interface which were previously all required to adjust the proper counter prior to calling mi_switch(). This also facilitates more switch and locking optimizations. - Change all callers of mi_switch() to pass the appropriate paramter and remove direct references to the process statistics.
# 123852	26-Dec-2003	alfred	Add __restrict qualifiers to copyinfrom, copyinstrfrom, copystr, copyinstr, copyin and copyout.
# 120665	02-Oct-2003	nectar	Introduce a uiomove_frombuf helper routine that handles computing and validating the offset within a given memory buffer before handing the real work off to uiomove(9). Use uiomove_frombuf in procfs to correct several issues with integer arithmetic that could result in underflows/overflows. As a side-effect, the code is significantly simplified. Add additional sanity checks when computing a memory allocation size in pfs_read. Submitted by: rwatson (original uiomove_frombuf -- bugs are mine :-) Reported by: Joost Pol <joost@pine.nl> (integer underflows/overflows)
# 116182	10-Jun-2003	obrien	Use __FBSDID().
# 116110	09-Jun-2003	alc	- Add vm object locking to vm_pgmoveco(). - Add a comment to vm_pgmoveco() describing what remains to be done for vm locking.
# 114758	05-May-2003	jhb	Tweak the clearing of TDF_DEADLKTREAT so that we only bother grabbing the lock and clearing the flag if it was clear when uiomove() was called.
# 112631	25-Mar-2003	jhb	Remove extraneous check. We are not going to return from copyin/out on the stack of a thread A but actually be thread B instead of thread A.
# 111977	08-Mar-2003	ken	Zero copy send and receive fixes: - On receive, vm_map_lookup() needs to trigger the creation of a shadow object. To make that happen, call vm_map_lookup() with PROT_WRITE instead of PROT_READ in vm_pgmoveco(). - On send, a shadow object will be created by the vm_map_lookup() in vm_fault(), but vm_page_cowfault() will delete the original page from the backing object rather than simply letting the legacy COW mechanism take over. In other words, the new page should be added to the shadow object rather than replacing the old page in the backing object. (i.e. vm_page_cowfault() should not be called in this case.) We accomplish this by making sure fs.object == fs.first_object before calling vm_page_cowfault() in vm_fault(). Submitted by: gallatin, alc Tested by: ken
# 111937	06-Mar-2003	alc	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
# 111739	02-Mar-2003	des	Convert one of our main caddr_t consumers, uiomove(9), to void *.
# 111737	02-Mar-2003	des	Clean up whitespace, unregisterize, ANSIfy, remove prototypes made superfluous by ANSIfication.
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 108384	29-Dec-2002	alc	Reduce the number of times that we acquire and release the page queues lock by making vm_page_rename()'s caller, rather than vm_page_rename(), responsible for acquiring it.
# 108139	20-Dec-2002	alc	Extend the scope of the page queues lock in vm_pgmoveco().
# 108053	18-Dec-2002	alc	Hold the page queues lock when performing vm_page_busy().
# 107371	28-Nov-2002	alc	Use pmap_remove_all() instead of pmap_remove() before freeing the page in vm_pgmoveco(); the page may have more than one mapping. Hold the page queues lock when calling pmap_remove_all(). Approved by: re (blanket)
# 104964	12-Oct-2002	jeff	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
# 104908	11-Oct-2002	mike	Change iov_base's type from `char ' to the standard `void '. All uses of iov_base which assume its type is `char ' (in order to do pointer arithmetic) have been updated to cast iov_base to `char '.
# 101303	04-Aug-2002	alc	o Convert a vm_page_sleep_busy() into a vm_page_sleep_if_busy() with appropriate page queue locking.
# 100448	21-Jul-2002	alc	o Lock page queue accesses by vm_page_free().
# 99848	12-Jul-2002	ken	Fix compilation with ENABLE_VFS_IOOPT turned on and ZERO_COPY_SOCKETS turned off. Clean up #ifdefs, and remove a bunch of unnecessary includes. Reviewed by: bde Tested by: netchild
# 99098	30-Jun-2002	iedowse	Add a hashdestroy() function to undo the actions of hashinit().
# 99072	29-Jun-2002	julian	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 98998	28-Jun-2002	alfred	More caddr_t removal. Change struct knote's kn_hook from caddr_t to void *.
# 98849	26-Jun-2002	ken	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# 98477	20-Jun-2002	peter	Remove UIO_USERISPACE - we do not support any split instruction/data address space machines (eg: pdp-11) and are not likely to ever do so. Nothing in our kernel sets this.
# 96080	05-May-2002	alc	o Condition the compilation of uiomoveco() and vm_uiomove() on ENABLE_VFS_IOOPT. o Add a comment to the effect that this code is experimental support for zero-copy I/O.
# 90538	11-Feb-2002	julian	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)
# 90413	08-Feb-2002	tmm	Fix a bug introduced in r. 1.28: when copy{in,out} would fail for an iovec that was not the last one in the uio, the error would be ignored silently. Bug found and fix proposed by: jhb
# 88900	05-Jan-2002	jhb	Change the preemption code for software interrupt thread schedules and mutex releases to not require flags for the cases when preemption is not allowed: The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent switching to a higher priority thread on mutex releease and swi schedule, respectively when that switch is not safe. Now that the critical section API maintains a per-thread nesting count, the kernel can easily check whether or not it should switch without relying on flags from the programmer. This fixes a few bugs in that all current callers of swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from fast interrupt handlers and the swi_sched of softclock needed this flag. Note that to ensure that swi_sched()'s in clock and fast interrupt handlers do not switch, these handlers have to be explicitly wrapped in critical_enter/exit pairs. Presently, just wrapping the handlers is sufficient, but in the future with the fully preemptive kernel, the interrupt must be EOI'd before critical_exit() is called. (critical_exit() can switch due to a deferred preemption in a fully preemptive kernel.) I've tested the changes to the interrupt code on i386 and alpha. I have not tested ia64, but the interrupt code is almost identical to the alpha code, so I expect it will work fine. PowerPC and ARM do not yet have interrupt code in the tree so they shouldn't be broken. Sparc64 is broken, but that's been ok'd by jake and tmm who will be fixing the interrupt code for sparc64 shortly. Reviewed by: peter Tested on: i386, alpha
# 83959	26-Sep-2001	dillon	Make uio_yield() a global. Call uio_yield() between chunks in vn_rdwr_inchunks(), allowing other processes to gain an exclusive lock on the vnode. Specifically: directory scanning, to avoid a race to the root directory, and multiple child processes coring simultaniously so they can figure out that some other core'ing child has an exclusive adv lock and just exit instead. This completely fixes performance problems when large programs core. You can have hundreds of copies (forked children) of the same binary core all at once and not notice. MFC after: 3 days
# 83420	13-Sep-2001	jhb	Fix locking on td_flags for TDF_DEADLKTREAT. If the comments in the code are true that curthread can change during this function, then this flag needs to become a KSE flag, not a thread flag.
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 79145	03-Jul-2001	jhb	Remove spl's in uio_yield() that are covered by the sched_lock.
# 78431	18-Jun-2001	wollman	After one too many PRs on the subject, bite the bullet and define IOV_MAX and its associated constants. Implement _SC_IOV_MAX in the usual way. Be a bit sloppy about the namespace question; this should get cleared up in time for 5.0. MFC after: 1 month
# 76166	01-May-2001	markm	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 72537	16-Feb-2001	jlemon	Introduce copyinfrom and copyinstrfrom, which can copy data from either user or kernel space. This will allow layering of os-compat (e.g.: linux) system calls. Apply the changes to mount.
# 72376	11-Feb-2001	jake	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 70861	10-Jan-2001	jake	Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables other then curproc.
# 68862	17-Nov-2000	jake	- Split the run queue and sleep queue linkage, so that a process may block on a mutex while on the sleep queue without corrupting it. - Move dropping of Giant to after the acquire of sched_lock. Tested by: John Hay <jhay@icomtek.csir.co.za> jhb
# 68808	16-Nov-2000	jhb	Don't release and acquire Giant in mi_switch(). Instead, release and acquire Giant as needed in functions that call mi_switch(). The releases need to be done outside of the sched_lock to avoid potential deadlocks from trying to acquire Giant while interrupts are disabled. Submitted by: witness
# 67365	20-Oct-2000	jhb	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# 65841	14-Sep-2000	eivind	GC vax-only code
# 65557	06-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 48677	08-Jul-1999	mckusick	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 44681	12-Mar-1999	julian	This solves a deadlock that can occur when read()ing into a file-mmap() space. When doing this, it is possible to for another process to attempt to get an exclusive lock on the vnode and deadlock the mmap/read combination when the uiomove() call tries to obtain a second shared lock on the vnode. There is still a potential deadlock situation with write()/mmap(). Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Luoqi Chen <luoqi@freebsd.org> Delimmitted by tag PRE_MATT_MMAP_LOCK and POST_MATT_MMAP_LOCK in kern/kern_lock.c kern/kern_subr.c
# 44220	22-Feb-1999	bde	The previous commit also fixed a possibly-wrong (too high) priority for the rescheduled process.
# 44218	22-Feb-1999	bde	Improved scheduling in uiomove(), etc. resched_wanted() is true too often for it to be a good criterion for switching kernel cpu hogs -- it is true after most wakeups. Use the criterion "has been running for >= 2 quanta" instead.
# 43529	02-Feb-1999	bde	Switch context before doing some i/o operations that might block if context would be switched on return to user mode. This fixes some denial of service problems.
# 42453	09-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 38068	04-Aug-1998	phk	remove nonsense code. PR: 7482 Reviewed by: phk Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
# 37649	15-Jul-1998	bde	Cast pointers to uintptr_t/intptr_t instead of to u_long/long, respectively. Most of the longs should probably have been u_longs, but this changes is just to prevent warnings about casts between pointers and integers of different sizes, not to fix poorly chosen types.
# 36735	07-Jun-1998	dfr	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33109	05-Feb-1998	dyson	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 31853	19-Dec-1997	dyson	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
# 30281	10-Oct-1997	phk	Rename "struct kmemstats" to "struct malloc_type" it makes more sense now. Fix type argument to hashinit() and phashinit()
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 26205	27-May-1997	alex	Define NPRIMES in terms of the number of elements in 'primes' (as opposed to hardcoding it).
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 8364	08-May-1995	dg	Improved hashinit panic strings.
# 8177	30-Apr-1995	dg	Added UIO_NOCOPY case to ureadc() to quiet gcc.
# 7683	08-Apr-1995	dg	Added a few more entries to the list of prime numbers.
# 7611	04-Apr-1995	dg	kern_subr.c: Added a new type to uiomove - "UIO_NOCOPY" which causes it to update pointers and counts, but doesn't do any data copying. This is needed for upcoming changes to the way that the vnode pager does its page outs. Added a new hash init function call "phashinit" that allocates and initializes a prime number sized hash table. vfs_cache.c: Changed hashing algorithm to use the remainder of dividing by a prime number to improve the distribution characteristcs. Uses new phashinit function in kern_subr.c.
# 6324	12-Feb-1995	dg	Minor code improvement to uiomove()
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources