History log of /freebsd-10.0-release/sys/vm/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
267017 03-Jun-2014 delphij

Fix sendmail improper close-on-exec flag handling. [SA-14:11]

Fix incorrect error handling in PAM policy parser. [SA-14:13]

Fix triple-fault when executing from a threaded process. [EN-14:06]

Approved by: so

260122 31-Dec-2013 kib

MFC r259951:
Do not coalesce stack entry. Pass MAP_STACK_GROWS_DOWN and
MAP_STACK_GROWS_UP flags to vm_map_insert() from vm_map_stack()

Approved by: re (delphij)

259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

258911 04-Dec-2013 rodrigc

MFC r258737

In keg_dtor(), print out the keg name in the "Freed UMA keg was not empty"
message printed to the console. This makes it easier to track down
the source of certain memory leaks.

Suggested by: adrian
Approved by: re (gjb)


258037 12-Nov-2013 kib

MFC r257680:
Do not coalesce if the swap object belongs to tmpfs vnode.

Approved by: re (glebius)


256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


256275 10-Oct-2013 alc

Tidy up the output of "sysctl vm.phys_free".

Approved by: re (glebius)
Sponsored by: EMC / Isilon Storage Division


255793 22-Sep-2013 alc

Both the vm_map and vmspace zones are defined as "no free". So, there is no
point in defining a fini function for these zones.

Reviewed by: kib
Approved by: re (glebius)
Sponsored by: EMC / Isilon Storage Division


255732 20-Sep-2013 neel

Merge the following changes from projects/bhyve_npt_pmap:
- add fields to 'struct pmap' that are required to manage nested page tables.
- add a parameter to 'vmspace_alloc()' that can be used to override the
default pmap initialization routine 'pmap_pinit()'.

These changes are pushed ahead of the remaining changes in 'bhyve_npt_pmap'
in anticipation of the upcoming KBI freeze for 10.0.

Reviewed by: kib@, alc@
Approved by: re (glebius)


255724 20-Sep-2013 alc

The pmap function pmap_clear_reference() is no longer used. Remove it.

pmap_clear_reference() has had exactly one caller in the kernel for
several years, more precisely, since FreeBSD 8. Now, that call no
longer exists.

Approved by: re (kib)
Sponsored by: EMC / Isilon Storage Division


255708 19-Sep-2013 jhb

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


255626 17-Sep-2013 kib

PG_SLAB no longer serves a useful purpose, since m->object is no
longer abused to store pointer to slab. Remove it.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (hrs)


255608 16-Sep-2013 kib

Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)


255566 14-Sep-2013 kib

If the last page of the file is partially full and whole valid
portion is invalidated, invalidate the whole page. Otherwise,
partially valid page appears on a page queue, which is wrong. This
could only happen for the last page, because only then buffer which
triggered invalidation could not cover the whole page.

Reported and tested by: pho (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)
MFC after: 2 weeks


255497 12-Sep-2013 jhb

Fix an off-by-one error when populating mincore(2) entries for
skipped entries. lastvecindex references the last valid byte,
so the new bytes should come after it.

Approved by: re (kib)
MFC after: 1 week


255426 09-Sep-2013 jhb

Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by: alc
Approved by: re (kib)


255396 08-Sep-2013 kib

Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)


255244 05-Sep-2013 kib

The vm_page_trysbusy() should not fail when shared busy counter or
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas. The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.

Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.

Reported and tested by: pho
Reviewed by: attilio
Sponsored by: The FreeBSD Foundation


255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


255097 31-Aug-2013 mckusick

Fix bug introduced in rewrite of keg_free_slab in -r251894.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.

I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.

No MFC needed as this code exists only in HEAD.

Reviewed by: kib, jeff
Tested by: pho


255028 29-Aug-2013 alc

Significantly reduce the cost, i.e., run time, of calls to madvise(...,
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).

Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.

Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().

Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division


254911 26-Aug-2013 glebius

Remove comment that is no longer relevant since r254182.


254719 23-Aug-2013 alc

Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop(). Detect this scenario so that the vnode's hold count is
correctly maintained. Otherwise, we panic.

Reported by: scottl
Tested by: pho
Discussed with: attilio, jeff, kib


254667 22-Aug-2013 kib

Revert r254501. Instead, reuse the type stability of the struct pmap
which is the part of struct vmspace, allocated from UMA_ZONE_NOFREE
zone. Initialize the pmap lock in the vmspace zone init function, and
remove pmap lock initialization and destruction from pmap_pinit() and
pmap_release().

Suggested and reviewed by: alc (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation


254649 22-Aug-2013 kib

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


254622 21-Aug-2013 jeff

- Eliminate the vm object lock from the active queue scan. It is not
necessary since we do not free or cache the page from active anymore.
Document the one possible race that is harmless.

Sponsored by: EMC / Isilon Storage Division
Discussed with: alc


254599 21-Aug-2013 alc

Addendum to r254141: Allow recursion on the free pages queues lock in
vm_page_alloc_freelist().

Reported and tested by: sbruno
Sponsored by: EMC / Isilon Storage Division


254544 20-Aug-2013 jeff

- Increase the active lru refresh interval to 10 minutes. This has been
shown to negatively impact some workloads and the goal is only to
eliminate worst case behaviors for very long periods of paging
inactivity. Eventually we should determine a more complex scaling
factor for this feature.
- Rate limit low memory callback handlers to limit thrashing. Set the
default to 10 seconds.

Sponsored by: EMC / Isilon Storage Division


254543 19-Aug-2013 jeff

- Use an arbitrary but reasonably large import size for kva on architectures
that don't support superpages. This keeps the number of spans and internal
fragmentation lower.
- When the user asks for alignment from vmem_xalloc adjust the imported size
by 2*align to be certain we can satisfy the allocation. This comes at
the expense of potential failures when the backend can't supply enough
memory but could supply the requested size and alignment.

Sponsored by: EMC / Isilon Storage Division


254439 17-Aug-2013 kib

Remove the arbitrary binding of the pagedaemon threads to the domains,
update the comment accordingly and make it more precise.

Requested and reviewed by: jeff (previous version)


254430 16-Aug-2013 jhb

Add new mmap(2) flags to permit applications to request specific virtual
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.

Reviewed by: alc
MFC after: 1 month


254387 15-Aug-2013 jeff

- Fix bug in r254304. Use the ACTIVE pq count for the active list
processing, not inactive. This was the result of a bad merge.

Reported by: pho
Sponsored by: EMC / Isilon Storage Division


254362 15-Aug-2013 attilio

On the recovery path for vm_page_alloc(), if a page had been requested
wired, unwind back the wiring bits otherwise we can end up freeing a
page that is considered wired.

Sponsored by: EMC / Isilon storage division
Reported by: alc


254307 13-Aug-2013 jeff

- Add a statically allocated memguard arena since it is needed very early
on.
- Pass the appropriate flags to vmem_xalloc() when allocating space for
the arena from kmem_arena.

Sponsored by: EMC / Isilon Storage Division


254304 13-Aug-2013 jeff

Improve pageout flow control to wakeup more frequently and do less work while
maintaining better LRU of active pages.

- Change v_free_target to include the quantity previously represented by
v_cache_min so we don't need to add them together everywhere we use them.
- Add a pageout_wakeup_thresh that sets the free page count trigger for
waking the page daemon. Set this 10% above v_free_min so we wakeup before
any phase transitions in vm users.
- Adjust down v_free_target now that we're willing to accept more pagedaemon
wakeups. This means we process fewer pages in one iteration as well,
leading to shorter lock hold times and less overall disruption.
- Eliminate vm_pageout_page_stats(). This was a minor variation on the
PQ_ACTIVE segment of the normal pageout daemon. Instead we now process
1 / vm_pageout_update_period pages every second. This causes us to visit
the whole active list every 60 seconds. Previously we would only maintain
the active LRU when we were short on pages which would mean it could be
woefully out of date.

Reviewed by: alc (slight variant of this)
Discussed with: alc, kib, jhb
Sponsored by: EMC / Isilon Storage Division


254228 11-Aug-2013 attilio

Correct the recovery logic in vm_page_alloc_contig:
what is really needed on this code snipped is that all the pages that
are already fully inserted gets fully freed, while for the others the
object removal itself might be skipped, hence the object might be set to
NULL.

Sponsored by: EMC / Isilon storage division
Reported by: alc, kib
Reviewed by: alc


254182 10-Aug-2013 kib

Different consumers of the struct vm_page abuse pageq member to keep
additional information, when the page is guaranteed to not belong to a
paging queue. Usually, this results in a lot of type casts which make
reasoning about the code correctness harder.

Sometimes m->object is used instead of pageq, which could cause real
and confusing bugs if non-NULL m->object is leaked. See r141955 and
r253140 for examples.

Change the pageq member into a union containing explicitly-typed
members. Use them instead of type-punning or abusing m->object in x86
pmaps, uma and vm_page_alloc_contig().

Requested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


254168 09-Aug-2013 zont

Remove unused definition for CTL_VM_NAMES.

Suggested by: bde


254163 09-Aug-2013 jhb

Revert the addition of VPO_BUSY and instead update vm_page_replace() to
properly unbusy the page.

Submitted by: alc


254150 09-Aug-2013 obrien

Add missing 'VPO_BUSY' from r254141 to fix kernel build break.


254141 09-Aug-2013 attilio

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


254138 09-Aug-2013 attilio

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


254065 07-Aug-2013 kib

Split the pagequeues per NUMA domains, and split pageademon process
into threads each processing queue in a single domain. The structure
of the pagedaemons and queues is kept intact, most of the changes come
from the need for code to find an owning page queue for given page,
calculated from the segment containing the page.

The tie between NUMA domain and pagedaemon thread/pagequeue split is
rather arbitrary, the multithreaded daemon could be allowed for the
single-domain machines, or one domain might be split into several page
domains, to further increase concurrency.

Right now, each pagedaemon thread tries to reach the global target,
precalculated at the start of the pass. This is not optimal, since it
could cause excessive page deactivation and freeing. The code should
be changed to re-check the global page deficit state in the loop after
some number of iterations.

The pagedaemons reach the quorum before starting the OOM, since one
thread inability to meet the target is normal for split queues. Only
when all pagedaemons fail to produce enough reusable pages, OOM is
started by single selected thread.

Launder is modified to take into account the segments layout with
regard to the region for which cleaning is performed.

Based on the preliminary patch by jeff, sponsored by EMC / Isilon
Storage Division.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation


254025 07-Aug-2013 jeff

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


254017 07-Aug-2013 markj

Fill in the description fields for M_FICT_PAGES.

Reviewed by: kib
MFC after: 3 days


253953 05-Aug-2013 attilio

Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by: EMC / Isilon storage division
Reported by: kib


253939 04-Aug-2013 attilio

The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho


253775 29-Jul-2013 zont

Unbreak sysctl ABI changes introduced in r253662

Requested by: bde


253697 26-Jul-2013 jeff

Improve page LRU quality and simplify the logic.

- Don't short-circuit aging tests for unmapped objects. This biases
against unmapped file pages and transient mappings.
- Always honor PGA_REFERENCED. We can now use this after soft busying
to lazily restart the LRU.
- Don't transition directly from active to cached bypassing the inactive
queue. This frees recently used data much too early.
- Rename actcount to act_delta to be more consistent with use and meaning.

Reviewed by: kib, alc
Sponsored by: EMC / Isilon Storage Division


253662 26-Jul-2013 zont

Remove define and documentation for vm_pageout_algorithm missed in r253587


253636 25-Jul-2013 kientzle

Clear entire map structure including locks so that the
locks don't accidentally appear to have been already
initialized.

In particular, this fixes a consistent kernel crash on
armv6 with:
panic: lock "vm map (user)" 0xc09cc050 already initialized
that appeared with r251709.

PR: arm/180820


253604 24-Jul-2013 avg

rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST

Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
invoked depending on its relative order with scheduler; this was not obvious
and the bug actually used to exist

Reviewed by: kib (ealier version)
MFC after: 14 days


253591 24-Jul-2013 glebius

Since r251709 a slab no longer use 8-bit indicies to manage items,
thus remove a stale comment.

Reviewed by: jeff


253587 24-Jul-2013 jeff

- Remove the long obsolete 'vm_pageout_algorithm' experiment.

Discussed with: alc
Sponsored by: EMC / Isilon Storage Division


253583 23-Jul-2013 jeff

- Correct a stale comment. We don't have vclean() anymore. The work is
done by vgonel() and destroy_vobject() should only be called once from
VOP_INACTIVE().

Sponsored by: EMC / Isilon Storage Division


253565 23-Jul-2013 glebius

Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by: kib
Tested by: Lytochkin Boris <lytboris gmail.com>


253556 22-Jul-2013 jlh

Fix previous commit when option RACCT is not used.

MFC after: 7 days


253554 22-Jul-2013 jlh

Fix a panic in the racct code when munlock(2) is called with incorrect values.

The racct code in sys_munlock() assumed that the boundaries provided by the
userland were correct as long as vm_map_unwire() returned successfully.
However the latter contains its own logic and sometimes manages to do something
out of those boundaries, even if they are buggy. This change makes the racct
code to use the accounting done by the vm layer, as it is done in other places
such as vm_mlock().

Despite fixing the panic, Alan Cox pointed that this code is still race-y
though: two simultaneous callers will produce incorrect values.

Reviewed by: alc
MFC after: 7 days


253471 19-Jul-2013 jhb

Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
vm_map_find() that will try to alter the alignment of a mapping to match
any existing superpage mappings of the object being mapped. If no
suitable address range is found with the necessary alignment,
vm_map_find() will fall back to using the simple first-fit strategy
(VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by: alc (earlier version)
MFC after: 2 weeks


253221 11-Jul-2013 kib

When swap pager allocates metadata in the pagedaemon context, allow it
to drain the reserve. This was broken in r243040, causing deadlock.
Note that VM_WAIT call in case of uma_zalloc() failure from pagedaemon
would only wait for the v_pageout_free_min anyway.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation


253191 11-Jul-2013 kib

The vm_fault() should not be allowed to proceed on the map entry which
is being wired now. The entry wired count is changed to non-zero in
advance, before the map lock is dropped. This makes the vm_fault() to
perceive the entry as wired, and breaks the fragment which moves the
wire count from the shadowed page, to the upper page, making the code
unwiring non-wired page.

On the other hand, the vm_fault() calls from vm_fault_wire() should be
allowed to proceed, so only drain MAP_ENTRY_IN_TRANSITION from
vm_fault() when wiring_thread is not current.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


253190 11-Jul-2013 kib

The mlockall() or VM_MAP_WIRE_HOLESOK does not interact properly with
parallel creation of the map entries, e.g. by mmap() or stack growing.
It also breaks when other entry is wired in parallel.

The vm_map_wire() iterates over the map entries in the region, and
assumes that map entries it finds are marked as in transition before,
also that any entry marked as in transition, are marked by the current
invocation of vm_map_wire(). This is not true for new entries in the
holes.

Add the thread owner of the MAP_ENTRY_IN_TRANSITION flag to struct
vm_map_entry. In vm_map_wire() and vm_map_unwire(), only process the
entries which transition owner is the current thread.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


253189 11-Jul-2013 kib

Never remove user-wired pages from an object when doing
msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object
range which corresponds to the user-wired vm_map_entry, is always
fully populated.

Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the
preserving behaviour, use it when calling vm_object_page_remove() from
vm_object_sync().

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


253188 11-Jul-2013 kib

In the vm_page_set_invalid() function, do not assert that the page is
not busy, since its only caller brelse() can legitimately call it on
busy page. This happens for VOP_PUTPAGES() on filesystems that use
buffers and which VOP_WRITE() method marked the buffer containing page
as non-cacheable.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


253095 09-Jul-2013 kib

Fix typo in comment.

MFC after: 3 days


252653 03-Jul-2013 neel

vm_phys_fictitious_reg_range() was losing the 'memattr' because it would be
reset by pmap_page_init() right after being initialized in vm_page_initfake().

The statement above is with reference to the amd64 implementation of
pmap_page_init().

Fix this by calling 'pmap_page_init()' in 'vm_page_initfake()' before changing
the 'memattr'.

Reviewed by: kib
MFC after: 2 weeks


252358 28-Jun-2013 davide

Remove a spurious keg lock acquisition.


252330 28-Jun-2013 jeff

- Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.

Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


252226 26-Jun-2013 jeff

- Resolve bucket recursion issues by passing a cookie with zone flags
through bucket_alloc() to uma_zalloc_arg() and uma_zfree_arg().
- Make some smaller buckets for large zones to further reduce memory
waste.
- Implement uma_zone_reserve(). This holds aside a number of items only
for callers who specify M_USE_RESERVE. buckets will never be filled
from reserve allocations.

Sponsored by: EMC / Isilon Storage Division


252161 24-Jun-2013 glebius

Typo in comment.


252040 20-Jun-2013 jeff

- Add a per-zone lock for zones without kegs.
- Be more explicit about zone vs keg locking. This functionally changes
almost nothing.
- Add a size parameter to uma_zcache_create() so we can size the buckets.
- Pass the zone to bucket_alloc() so it can modify allocation flags
as appropriate.
- Fix a bug in zone_alloc_bucket() where I missed an address of operator
in a failure case. (Found by pho)

Sponsored by: EMC / Isilon Storage Division


251983 19-Jun-2013 jeff

- Persist the caller's flags in the bucket allocation flags so we don't
lose a M_NOVM when we recurse into a bucket allocation.

Sponsored by: EMC / Isilon Storage Division


251901 18-Jun-2013 des

Fix a bug that allowed a tracing process (e.g. gdb) to write
to a memory-mapped file in the traced process's address space
even if neither the traced process nor the tracing process had
write access to that file.

Security: CVE-2013-2171
Security: FreeBSD-SA-13:06.mmap
Approved by: so


251894 18-Jun-2013 jeff

Refine UMA bucket allocation to reduce space consumption and improve
performance.

- Always free to the alloc bucket if there is space. This gives LIFO
allocation order to improve hot-cache performance. This also allows
for zones with a single bucket per-cpu rather than a pair if the entire
working set fits in one bucket.
- Enable per-cpu caches of buckets. To prevent recursive bucket
allocation one bucket zone still has per-cpu caches disabled.
- Pick the initial bucket size based on a table driven maximum size
per-bucket rather than the number of items per-page. This gives
more sane initial sizes.
- Only grow the bucket size when we face contention on the zone lock, this
causes bucket sizes to grow more slowly.
- Adjust the number of items per-bucket to account for the header space.
This packs the buckets more efficiently per-page while making them
not quite powers of two.
- Eliminate the per-zone free bucket list. Always return buckets back
to the bucket zone. This ensures that as zones grow into larger
bucket sizes they eventually discard the smaller sizes. It persists
fewer buckets in the system. The locking is slightly trickier.
- Only switch buckets in zalloc, not zfree, this eliminates pathological
cases where we ping-pong between two buckets.
- Ensure that the thread that fills a new bucket gets to allocate from
it to give a better upper bound on allocation time.

Sponsored by: EMC / Isilon Storage Division


251826 17-Jun-2013 jeff

- Add a new UMA API: uma_zcache_create(). This makes a zone without any
backing memory that is only a container for per-cpu caches of arbitrary
pointer items. These zones have no kegs.
- Convert the regular keg based allocator to use the new import/release
functions.
- Move some stats to be atomics since they would require excessive zone
locking/unlocking with the new import/release paradigm. Make
zone_free_item simpler now that callers can manage more stats.
- Check for these cache-only zones in the public APIs and debugging
code by checking zone_first_keg() against NULL.

Sponsored by: EMC / Isilong Storage Division


251709 13-Jun-2013 jeff

- Convert the slab free item list from a linked array of indices to a
bitmap using sys/bitset. This is much simpler, has lower space
overhead and is cheaper in most cases.
- Use a second bitmap for invariants asserts and improve the quality of
the asserts as well as the number of erroneous conditions that we will
catch.
- Drastically simplify sizing code. Special case refcnt zones since they
will be going away.
- Update stale comments.

Sponsored by: EMC / Isilon Storage Division


251591 10-Jun-2013 alc

Revise the interface between vm_object_madvise() and vm_page_dontneed() so
that pointless calls to pmap_is_modified() can be easily avoided when
performing madvise(..., MADV_FREE).

Sponsored by: EMC / Isilon Storage Division


251523 08-Jun-2013 glebius

Make sys_mlock() function just a wrapper around vm_mlock() function
that does all the job.

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


251471 06-Jun-2013 attilio

Complete r251452:
Avoid to busy/unbusy a page in cases where there is no need to drop the
vm_obj lock, more nominally when the page is full valid after
vm_page_grab().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


251397 04-Jun-2013 attilio

In vm_object_split(), busy and consequently unbusy the pages only when
swap_pager_copy() is invoked, otherwise there is no reason to do so.
This will eliminate the necessity to busy pages most of the times.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


251367 04-Jun-2013 alc

Update a comment.


251359 04-Jun-2013 alc

Relax the object locking in vm_pageout_map_deactivate_pages() and
vm_pageout_object_deactivate_pages(). A read lock suffices.

Sponsored by: EMC / Isilon Storage Division


251318 03-Jun-2013 kib

Remove irrelevant comments.

Discussed with: alc
MFC after: 3 days


251280 03-Jun-2013 alc

Require that the page lock is held, instead of the object lock, when
clearing the page's PGA_REFERENCED flag. Since we are typically
manipulating the page's act_count field when we are clearing its
PGA_REFERENCED flag, the page lock is already held everywhere that we clear
the PGA_REFERENCED flag. So, in fact, this revision only changes some
comments and an assertion. Nonetheless, it will enable later changes to
object locking in the pageout code.

Introduce vm_page_assert_locked(), which completely hides the implementation
details of the page lock from the caller, and use it in
vm_page_aflag_clear(). (The existing vm_page_lock_assert() could not be
used in vm_page_aflag_clear().) Over the coming weeks, I expect that we'll
either eliminate or replace the various uses of vm_page_lock_assert() with
vm_page_assert_locked().

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


251229 01-Jun-2013 alc

Now that access to the page's "act_count" field is synchronized by the page
lock instead of the object lock, there is no reason for vm_page_activate()
to assert that the object is locked for either read or write access.
(The "VPO_UNMANAGED" flag never changes after page allocation.)

Sponsored by: EMC / Isilon Storage Division


251183 31-May-2013 alc

Simplify the definition of vm_page_lock_assert(). There is no compelling
reason to inline the implementation of vm_page_lock_assert() in the
!KLD_MODULES case. Use the same implementation for both KLD_MODULES and
!KLD_MODULES.

Reviewed by: kib


251151 30-May-2013 kib

After the object lock was dropped, the object' reference count could
change. Retest the ref_count and return from the function to not
execute the further code which assumes that ref_count == 1 if it is
not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS
flag meantime.

Reported by: bdrewery
Tested by: bdrewery, pho
Sponsored by: The FreeBSD Foundation


251150 30-May-2013 kib

Remove the capitalization in the assertion message. Print the address
of the object to get useful information from optimizated kernels dump.


251077 28-May-2013 attilio

o Change the locking scheme for swp_bcount.
It can now be accessed with a write lock on the object containing it OR
with a read lock on the object containing it along with the swhash_mtx.
o Remove some duplicate assertions for swap_pager_freespace() and
swap_pager_unswapped() but keep the object locking references for
documentation.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


250909 22-May-2013 attilio

Acquire read lock on the src object for vm_fault_copy_entry().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


250884 21-May-2013 attilio

o Relax locking assertions for vm_page_find_least()
o Relax locking assertions for pmap_enter_object() and add them also
to architectures that currently don't have any
o Introduce VM_OBJECT_LOCK_DOWNGRADE() which is basically a downgrade
operation on the per-object rwlock
o Use all the mechanisms above to make vm_map_pmap_enter() to work
mostl of the times only with readlocks.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


250849 21-May-2013 kib

Add ddb command 'show pginfo' which provides useful information about
a vm page, denoted either by an address of the struct vm_page, or, if
the '/p' modifier is specified, by a physical address of the
corresponding frame.

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


250748 17-May-2013 alc

Relax the object locking in vm_fault_prefault(). A read lock suffices.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


250745 17-May-2013 alc

Relax the object locking assertion in vm_page_lookup(). Now that a radix
tree is used to maintain the object's collection of resident pages,
vm_page_lookup() no longer needs an exclusive lock.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


250601 13-May-2013 attilio

o Add accessor functions to add and remove pages from a specific
freelist.
o Split the pool of free pages queues really by domain and not rely on
definition of VM_RAW_NFREELIST.
o For MAXMEMDOM > 1, wrap the RR allocation logic into a specific
function that is called when calculating the allocation domain.
The RR counter is kept, currently, per-thread.
In the future it is expected that such function evolves in a real
policy decision referee, based on specific informations retrieved by
per-thread and per-vm_object attributes.
o Add the concept of "probed domains" under the form of vm_ndomains.
It is responsibility for every architecture willing to support multiple
memory domains to correctly probe vm_ndomains along with mem_affinity
segments attributes. Those two values are supposed to remain always
consistent.
Please also note that vm_ndomains and td_dom_rr_idx are both int
because segments already store domains as int. Ideally u_int would
have much more sense. Probabilly this should be cleaned up in the
future.
o Apply RR domain selection also to vm_phys_zero_pages_idle().

Sponsored by: EMC / Isilon storage division
Partly obtained from: jeff
Reviewed by: alc
Tested by: jeff


250594 13-May-2013 peter

Bandaid for compiling with gcc, which happens to be the default compiler
for a number of platforms still.


250577 12-May-2013 alc

Refactor vm_page_alloc()'s interactions with vm_reserv_alloc_page() and
vm_page_insert() so that (1) vm_radix_lookup_le() is never called while the
free page queues lock is held and (2) vm_radix_lookup_le() is called at most
once. This change reduces the average time that the free page queues lock
is held by vm_page_alloc() as well as vm_page_alloc()'s average overall
running time.

Sponsored by: EMC / Isilon Storage Division


250520 11-May-2013 alc

To reduce the amount of arithmetic performed in the various radix tree
functions, reverse the numbering scheme for the levels. The highest
numbered level in the tree now appears near the root instead of the leaves.

Sponsored by: EMC / Isilon Storage Division


250361 08-May-2013 attilio

Fix-up r250338 by completing the removal of VM_NDOMAIN in favor of
MAXMEMDOM.
This unbreak builds.

Sponsored by: EMC / Isilon storage division
Reported by: adrian, jeli


250338 07-May-2013 attilio

Rename VM_NDOMAIN into MAXMEMDOM and move it into machine/param.h in
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.

Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc


250334 07-May-2013 alc

Remove a redundant call to panic() from vm_radix_keydiff(). The assertion
before the loop accomplishes the same thing.

Sponsored by: EMC / Isilon Storage Division


250259 04-May-2013 alc

Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically,
change the way that these functions ascend the tree when the search for a
matching leaf fails at an interior node. Rather than returning to the root
of the tree and repeating the lookup with an updated key, maintain a stack
of interior nodes that were visited during the descent and use that stack
to resume the lookup at the closest ancestor that might have a matching
descendant.

Sponsored by: EMC / Isilon Storage Division
Reviewed by: attilio
Tested by: pho


250219 03-May-2013 jhb

Fix two bugs in the current NUMA-aware allocation code:
- vm_phys_alloc_freelist_pages() can be called by vm_page_alloc_freelist()
to allocate a page from a specific freelist. In the NUMA case it did not
properly map the public VM_FREELIST_* constants to the correct backing
freelists, nor did it try all NUMA domains for allocations from
VM_FREELIST_DEFAULT.
- vm_phys_alloc_pages() did not pin the thread and each call to
vm_phys_alloc_freelist_pages() fetched the current domain to choose
which freelist to use. If a thread migrated domains during the loop
in vm_phys_alloc_pages() it could skip one of the freelists. If the
other freelists were out of memory then it is possible that
vm_phys_alloc_pages() would fail to allocate a page even though pages
were available resulting in a panic in vm_page_alloc().

Reviewed by: alc
MFC after: 1 week


250187 02-May-2013 kib

Add a hint suggesting why tmpfs does not need a special case there.


250030 28-Apr-2013 kib

Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering. Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 month


250029 28-Apr-2013 kib

Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode'
v_object of non OBJT_VNODE type.

For vm_object_page_clean(), simply do not assert that object type must
be OBJT_VNODE, and add a comment explaining how the check for
OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such
objects.

For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it
to be for swap pager (or default), handle the bypass filesystems, and
correctly acquire the object reference in this case.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week


250028 28-Apr-2013 kib

Assert that the object type for the vnode' non-NULL v_object, passed
to vnode_pager_setsize(), is either OBJT_VNODE, or, if vnode was
already reclaimed, OBJT_DEAD. Note that the later is only possible
due to some filesystems, in particular, nfsiods from nfs clients, call
vnode_pager_setsize() with unlocked vnode.

More, if the object is terminated, do not perform the resizing
operation.

Reviewed by: alc
Tested by: pho, bf
MFC after: 1 week


250026 28-Apr-2013 kib

Convert panic() into KASSERT().

Reviewed by: alc
MFC after: 1 week


250018 28-Apr-2013 alc

Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le().
This call is clearing bits from the key that will be set again by the next
line.

Sponsored by: EMC / Isilon Storage Division


249986 27-Apr-2013 alc

Avoid some lookup restarts in vm_radix_lookup_{ge,le}().

Sponsored by: EMC / Isilon Storage Division


249763 22-Apr-2013 glebius

Panic if UMA_ZONE_PCPU is created at early stages of boot, when mp_ncpus
isn't yet initialized. Otherwise we will panic at first allocation later.

Sponsored by: Nginx, Inc.


249745 22-Apr-2013 alc

Simplify vm_radix_{add,dec}lev().

Sponsored by: EMC / Isilon Storage Division


249605 18-Apr-2013 alc

When calculating the number of reserved nodes, discount the pages that will
be used to store the nodes.

Sponsored by: EMC / Isilon Storage Division


249502 15-Apr-2013 alc

Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child. This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes. Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions. For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.

Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division


249427 12-Apr-2013 alc

Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child. This change is the first step in
eliminating those unnecessary level zero interior nodes. Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.

Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division


249313 09-Apr-2013 glebius

Convert UMA code to C99 uintXX_t types.


249312 09-Apr-2013 glebius

Swap us_freecount and us_flags, achieving same structure size
as before previous commit.

Submitted by: alc


249309 09-Apr-2013 glebius

Since now we support 256 items per slab, we need more bits
for us_freecount.

This grows uma_slab_head on 32-bit arches, but growth isn't
significant. Taking kmem zones as example, only the 32 byte
zone is affected, ipers is reduced from 113 to 112.

In collaboration with: kib


249305 09-Apr-2013 glebius

Fix KASSERTs: maximum number of items per slab is 256.


249303 09-Apr-2013 kib

Fix the assertions for the state of the object under the map entry
with the MAP_ENTRY_VN_WRITECNT flag:
- Move the assertion that verifies the state of the v_writecount and
vnp.writecount, under the block where the object is locked.
- Check that the object type is OBJT_VNODE before asserting.

Reported by: avg
Reviewed by: alc
MFC after: 1 week


249278 08-Apr-2013 attilio

The per-page act_count can be made very-easily protected by the
per-page lock rather than vm_object lock, without any further overhead.
Make the formal switch.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


249264 08-Apr-2013 glebius

Merge from projects/counters: UMA_ZONE_PCPU zones.

These zones have slab size == sizeof(struct pcpu), but request from VM
enough pages to fit (uk_slabsize * mp_ncpus). An item allocated from such
zone would have a separate twin for each CPU in the system, and these twins
are at a distance of sizeof(struct pcpu) from each other. This magic value
of distance would allow us to make some optimizations later.

To address private item from a CPU simple arithmetics should be used:

item = (type *)((char *)base + sizeof(struct pcpu) * curcpu)

These arithmetics are available as zpcpu_get() macro in pcpu.h.

To introduce non-page size slabs a new field had been added to uma_keg
uk_slabsize. This shifted some frequently used fields of uma_keg to the
fourth cache line on amd64. To mitigate this pessimization, uma_keg fields
were a bit rearranged and least frequently used uk_name and uk_link moved
down to the fourth cache line. All other fields, that are dereferenced
frequently fit into first three cache lines.

Sponsored by: Nginx, Inc.


249221 07-Apr-2013 alc

Micro-optimize the order of struct vm_radix_node's fields. Specifically,
arrange for all of the fields to start at a short offset from the
beginning of the structure.

Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in
vm_radix_getroot().

Sponsored by: EMC / Isilon Storage Division


249218 06-Apr-2013 jeff

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


249211 06-Apr-2013 alc

Simplify vm_radix_keybarr().

Sponsored by: EMC / Isilon Storage Division


249182 06-Apr-2013 alc

Simplify vm_radix_insert().

Reviewed by: attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


249038 03-Apr-2013 alc

Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and
vm_radix_topage(). This transformation eliminates some unnecessary
conditional branches from the inner loops of vm_radix_insert(),
vm_radix_lookup{,_ge,_le}(), and vm_radix_remove().

Simplify the control flow of vm_radix_lookup_{ge,le}().

Reviewed by: attilio (an earlier version)
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


248815 28-Mar-2013 kib

Release the v_writecount reference on the vnode in case of error,
before the vnode is vput() in vm_mmap_vnode(). Error return means
that there is no use reference on the vnode from the vm object
reference, and failing to restore v_writecount breaks the invariant
that v_writecount is less or equal to the usecount.

The situation observed when nfs client returns ESTALE for
VOP_GETATTR() after the open.

In collaboration with: pho
MFC after: 1 week


248728 26-Mar-2013 alc

Introduce vm_radix_isleaf() and use it in a couple places. As compared to
using vm_radix_node_page() == NULL, the compiler is able to generate one
less conditional branch when vm_radix_isleaf() is used. More use cases
involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(),
and vm_radix_remove() will follow.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


248684 24-Mar-2013 alc

Micro-optimize the control flow in a few places. Eliminate a panic call
that could never be reached in vm_radix_insert(). (If the pointer being
checked by the panic call were ever NULL, the immmediately preceding loop
would have already crashed on a NULL pointer dereference.)

Reviewed by: attilio (an earlier version)
Sponsored by: EMC / Isilon Storage Division


248569 21-Mar-2013 kib

Only size and create the bio_transient_map when unmapped buffers are
enabled. Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by: The FreeBSD Foundation


248550 20-Mar-2013 kib

Fix the logic inversion in the r248512.

Noted by: mckay


248514 19-Mar-2013 kib

Do not map the swap i/o pbufs if the geom provider for the swap
partition accepts unmapped requests.

Sponsored by: The FreeBSD Foundation
Tested by: pho


248512 19-Mar-2013 kib

Pass unmapped buffers for page in requests if the filesystem indicated support
for the unmapped i/o.

Sponsored by: The FreeBSD Foundation
Tested by: pho


248508 19-Mar-2013 kib

Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks


248449 18-Mar-2013 attilio

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


248283 14-Mar-2013 kib

Some style fixes.

Sponsored by: The FreeBSD Foundation


248280 14-Mar-2013 kib

Add pmap function pmap_copy_pages(), which copies the content of the
pages around, taking array of vm_page_t both for source and
destination. Starting offsets and total transfer size are specified.

The function implements optimal algorithm for copying using the
platform-specific optimizations. For instance, on the architectures
were the direct map is available, no transient mappings are created,
for i386 the per-cpu ephemeral page frame is used. The code was
typically borrowed from the pmap_copy_page() for the same
architecture.

Only i386/amd64, powerpc aim and arm/arm-v6 implementations were
tested at the time of commit. High-level code, not committed yet to
the tree, ensures that the use of the function is only allowed after
explicit enablement.

For sparc64, the existing code has known issues and a stab is added
instead, to allow the kernel linking.

Sponsored by: The FreeBSD Foundation
Tested by: pho (i386, amd64), scottl (amd64), ian (arm and arm-v6)
MFC after: 2 weeks


248277 14-Mar-2013 kib

Remove excessive and inconsistent initializers for the various kernel
maps and submaps.

MFC after: 2 weeks


248197 12-Mar-2013 attilio

Simplify vm_page_is_valid().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


248117 09-Mar-2013 alc

Update a comment: The object lock is no longer a mutex.


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


248082 09-Mar-2013 attilio

Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object. This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: alc (vm_radix based version)
Tested by: flo, pho, jhb, davide


248032 08-Mar-2013 andre

Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.

Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.

kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.

Reviewed by: davide


247788 04-Mar-2013 attilio

Merge from vmcontention:
As vm objects are type-stable there is no need to initialize the
resident splay tree pointer and the cache splay tree pointer in
_vm_object_allocate() but this could be done in the init UMA zone
handler.

The destructor UMA zone handler, will further check if the condition is
retained at every destruction and catch for bugs.

Sponsored by: EMC / Isilon storage division
Submitted by: alc


247659 02-Mar-2013 alc

The value held by the vm object's field pg_color is only considered
valid if the flag OBJ_COLORED is set. Since _vm_object_allocate()
doesn't set this flag, it needn't initialize pg_color.

Sponsored by: EMC / Isilon Storage Division


247602 02-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


247400 27-Feb-2013 attilio

Merge from vmobj-rwlock:
VM_OBJECT_LOCKED() macro is only used to implement a custom version
of lock assertions right now (which likely spread out thanks to
copy and paste).
Remove it and implement actual assertions.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


247360 26-Feb-2013 attilio

Merge from vmc-playground branch:
Replace the sub-optimal uma_zone_set_obj() primitive with more modern
uma_zone_reserve_kva(). The new primitive reserves before hand
the necessary KVA space to cater the zone allocations and allocates pages
with ALLOC_NOOBJ. More specifically:
- uma_zone_reserve_kva() does not need an object to cater the backend
allocator.
- uma_zone_reserve_kva() can cater M_WAITOK requests, in order to
serve zones which need to do uma_prealloc() too.
- When possible, uma_zone_reserve_kva() uses directly the direct-mapping
by uma_small_alloc() rather than relying on the KVA / offset
combination.

The removal of the object attribute allows 2 further changes:
1) _vm_object_allocate() becomes static within vm_object.c
2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by
direct calls to mtx_init() as there is no need to export it anymore
and the calls aren't either homogeneous anymore: there are now small
differences between arguments passed to mtx_init().

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (which also offered almost all the comments)
Tested by: pho, jhb, davide


247346 26-Feb-2013 attilio

Remove white spaces.

Sponsored by: EMC / Isilon storage division


247323 26-Feb-2013 attilio

Wrap the sleeps synchronized by the vm_object lock into the specific
macro VM_OBJECT_SLEEP().
This hides some implementation details like the usage of the msleep()
primitive and the necessity to access to the lock address directly.
For this reason VM_OBJECT_MTX() macro is now retired.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


246926 18-Feb-2013 alc

On arm, like sparc64, the end of the kernel map varies from one type of
machine to another. Therefore, VM_MAX_KERNEL_ADDRESS can't be a constant.
Instead, #define it to be a variable, vm_max_kernel_address, just like we
do on sparc64.

Reviewed by: kib
Tested by: ian


246805 14-Feb-2013 jhb

Make VM_NDOMAIN a kernel option so that it can be enabled from a kernel
config file.

Requested by: phk (ages ago)
MFC after: 1 month


246316 04-Feb-2013 marius

Try to improve r242655 take III: move these SYSCTLs describing the kernel
map, which is defined and initialized in vm/vm_kern.c, to the latter.

Submitted by: alc


246087 29-Jan-2013 glebius

Fix typo in debug printf.


246032 28-Jan-2013 zont

- Add system wide page faults requiring I/O counter.

Reviewed by: alc
MFC after: 2 weeks


246030 28-Jan-2013 zont

- Add sysctls to show number of stats scans.

MFC after: 2 weeks


246029 28-Jan-2013 zont

- Style.

MFC after: 2 weeks


245421 14-Jan-2013 zont

- Get rid of unused function vmspace_wired_count().

Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 week


245296 11-Jan-2013 zont

- Improve readability of sys_obreak().

Suggested by: alc
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 week


245255 10-Jan-2013 zont

- Reduce kernel size by removing unnecessary pointer indirections.

GENERIC kernel size reduced in 16 bytes and RACCT kernel in 336 bytes.

Suggested by: alc
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 1 week


245226 09-Jan-2013 ken

Fix a bug in the device pager code that can trigger an assertion
in devfs if a particular race condition is hit in the device pager
code.

This was a side effect of change 227530 which changed the device
pager interface to call a new destructor routine for the cdev.
That destructor routine, old_dev_pager_dtor(), takes a VM object
handle.

The object handle is cast to a struct cdev *, and passed into
dev_rel().

That works in most cases, except the case in cdev_pager_allocate()
where there is a race condition between two threads allocating an
object backed by the same device. The loser of the race
deallocates its object at the end of the function.

The problem is that before inserting the object into the
dev_pager_object_list, the object's handle is changed from the
struct cdev pointer to the object's own address. This is to avoid
conflicts with the winner of the race, which already inserted an
object in the list with a handle that is a pointer to the same cdev
structure.

The object is then passed to vm_object_deallocate(), and eventually
makes its way down to old_dev_pager_dtor(). That function passes
the handle pointer (which is actually a VM object, not a struct
cdev as usual) into dev_rel(). dev_rel() decrements the reference
count in the assumed struct cdev (which happens to be 0), and
that triggers the assertion in dev_rel() that the reference count
is greater than or equal to 0.

The fix is to add a cdev pointer to the VM object, and use that
pointer when calling the cdev_pg_dtor() routine.

vm_object.h: Add a struct cdev pointer to the VM object
structure.

device_pager.c: In cdev_pager_allocate(), populate the new cdev
pointer.

In dev_pager_dealloc(), use the new cdev pointer
when calling the object's cdev_pg_dtor() routine.

Reviewed by: kib
Sponsored by: Spectra Logic Corporation
MFC after: 1 week


244532 21-Dec-2012 glebius

Comment fix: there is no ub_ptr, instead explain meaning of uz_count
field verbally.


244384 18-Dec-2012 zont

- Fix locked memory accounting for maps with MAP_WIREFUTURE flag.
- Add sysctl vm.old_mlock which may turn such accounting off.

Reviewed by: avg, trasz
Approved by: kib (mentor)
MFC after: 1 week


244043 09-Dec-2012 alc

In the past four years, we've added two new vm object types. Each time,
similar changes had to be made in various places throughout the machine-
independent virtual memory layer to support the new vm object type.
However, in most of these places, it's actually not the type of the vm
object that matters to us but instead certain attributes of its pages.
For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain
fictitious pages. In other words, in most of these places, we were
testing the vm object's type to determine if it contained fictitious (or
unmanaged) pages.

To both simplify the code in these places and make the addition of future
vm object types easier, this change introduces two new vm object flags
that describe attributes of the vm object's pages, specifically, whether
they are fictitious or unmanaged.

Reviewed and tested by: kib


244024 08-Dec-2012 pjd

White-space cleanups.


243998 07-Dec-2012 pjd

Implemented uma_zone_set_warning(9) function that sets a warning, which
will be printed once the given zone becomes full and cannot allocate an
item. The warning will not be printed more often than every five minutes.

All UMA warnings can be globally turned off by setting sysctl/tunable
vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


243659 28-Nov-2012 alc

Add support for the (relatively) new object type OBJT_MGTDEVICE to
vm_object_set_memattr(). Also, add a "safety belt" so that
vm_object_set_memattr() doesn't silently modify undefined object types.

Reviewed by: kib
MFC after: 10 days


243529 25-Nov-2012 alc

Make a few small changes to vm_map_pmap_enter():

Add detail to the comment describing this function. In particular,
describe what MAP_PREFAULT_PARTIAL does.

Eliminate the abrupt change in behavior when the specified address range
grows from MAX_INIT_PT pages to MAX_INIT_PT plus one pages. Instead of
doing nothing, i.e., preloading no mappings whatsoever, map any resident
pages that fall within the start of the specified address range, i.e.,
[addr, addr + ulmin(size, ptoa(MAX_INIT_PT))).

Long ago, the vm object's list of resident pages was not ordered, so
this function had to choose between probing the global hash table of
all resident pages and iterating over the vm object's unordered list of
resident pages. Now, the list is ordered, so there is no reason for
MAP_PREFAULT_PARTIAL to be concerned with the vm object's count of
resident changes.

MFC after: 14 days


243366 21-Nov-2012 alc

Correct an error in r230623. When both VM_ALLOC_NODUMP and VM_ALLOC_ZERO
were specified to vm_page_alloc(), PG_NODUMP wasn't being set on the
allocated page when it happened to be pre-zeroed.

MFC after: 5 days


243333 20-Nov-2012 jh

- Don't pass geom and provider names as format strings.
- Add __printflike() attributes.
- Remove an extra argument for the g_new_geomf() call in swapongeom_ev().

Reviewed by: pjd


243176 17-Nov-2012 alc

Update a comment to reflect the elimination of the hold queue in r242300.


243132 16-Nov-2012 kib

Move the declaration of vm_phys_paddr_to_vm_page() from vm/vm_page.h
to vm/vm_phys.h, where it belongs.

Requested and reviewed by: alc
MFC after: 2 weeks


243131 16-Nov-2012 kib

Explicitely state that M_USE_RESERVE requires M_NOWAIT, using assertion.

Reviewed by: alc
MFC after: 2 weeks


243040 14-Nov-2012 kib

Flip the semantic of M_NOWAIT to only require the allocation to not
sleep, and perform the page allocations with VM_ALLOC_SYSTEM
class. Previously, the allocation was also allowed to completely drain
the reserve of the free pages, being translated to VM_ALLOC_INTERRUPT
request class for vm_page_alloc() and similar functions.

Allow the caller of malloc* to request the 'deep drain' semantic by
providing M_USE_RESERVE flag, now translated to VM_ALLOC_INTERRUPT
class. Previously, it resulted in less aggressive VM_ALLOC_SYSTEM
allocation class.

Centralize the translation of the M_* malloc(9) flags in the single
inline function malloc2vm_flags().

Discussion started by: "Sears, Steven" <Steven.Sears@netapp.com>
Reviewed by: alc, mdf (previous version)
Tested by: pho (previous version)
MFC after: 2 weeks


242941 13-Nov-2012 alc

Replace the single, global page queues lock with per-queue locks on the
active and inactive paging queues.

Reviewed by: kib


242903 12-Nov-2012 attilio

Fix DDB command "show map XXX":
- Check that an argument is always available, otherwise current map
printing before to recurse is garbage.
- Spit out a message if an argument is not provided.
- Remove unread nlines variable.
- Use an explicit recursive function, disassociated from the
DB_SHOW_COMMAND() body, in order to make clear prototype and recursion
of the above mentioned function. The code results now much less
obscure.

Submitted by: gianni


242476 02-Nov-2012 kib

The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by: pho
MFC after: 3 weeks


242434 01-Nov-2012 alc

In general, we call pmap_remove_all() before calling vm_page_cache(). So,
the call to pmap_remove_all() within vm_page_cache() is usually redundant.
This change eliminates that call to pmap_remove_all() and introduces a
call to pmap_remove_all() before vm_page_cache() in the one place where
it didn't already exist.

When iterating over a paging queue, if the object containing the current
page has a zero reference count, then the page can't have any managed
mappings. So, a call to pmap_remove_all() is pointless.

Change a panic() call in vm_page_cache() to a KASSERT().

MFC after: 6 weeks


242402 31-Oct-2012 attilio

Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by: jimharris, alc


242300 29-Oct-2012 alc

Replace the page hold queue, PQ_HOLD, by a new page flag, PG_UNHOLDFREE,
because the queue itself serves no purpose. When a held page is freed,
inserting the page into the hold queue has the side effect of setting the
page's "queue" field to PQ_HOLD. Later, when the page is unheld, it will
be freed because the "queue" field is PQ_HOLD. In other words, PQ_HOLD is
used as a flag, not a queue. So, this change replaces it with a flag.

To accomodate the new page flag, make the page's "flags" field wider and
"oflags" field narrower.

Reviewed by: kib


242268 28-Oct-2012 trasz

Remove useless check; vm_pindex_t is unsigned on all architectures.

CID: 3701
Found with: Coverity Prevent


242152 26-Oct-2012 mdf

Const-ify the zone name argument to uma_zcreate(9).

MFC after: 3 days


242151 26-Oct-2012 andre

Move the corresponding MTX_SYSINIT() next to their struct mtx declaration
to make their relationship more obvious as done with the other such mutexs.


242012 24-Oct-2012 kib

Commit the actual text provided by Alan, instead of the wrong update
in r242011.

MFC after: 1 week


242011 24-Oct-2012 kib

Dirty the newly copied anonymous pages after the wired region is
forked. Otherwise, pagedaemon might reclaim the page without saving
its content into the swap file, resulting in the valid content
replaced by zeroes.

Reported and tested by: pho
Reviewed and comment update by: alc
MFC after: 1 week


241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


241825 22-Oct-2012 eadler

Print flags as hex instead of an integer.

PR: kern/168210
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


241517 13-Oct-2012 alc

Move vm_page_requeue() to the only file that uses it.

MFC after: 3 weeks


241512 13-Oct-2012 alc

Eliminate the conditional for releasing the page queues lock in
vm_page_sleep(). vm_page_sleep() is no longer called with this lock
held.

Eliminate assertions that the page queues lock is NOT held. These
assertions won't translate well to having distinct locks on the active
and inactive page queues, and they really aren't that useful.

MFC after: 3 weeks


241155 03-Oct-2012 alc

Tidy up a bit:

Update some of the comments. In particular, use "sleep" in preference to
"block" where appropriate.

Eliminate some unnecessary casts.

Make a few whitespace changes for consistency.

Reviewed by: kib
MFC after: 3 days


241025 28-Sep-2012 kib

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


240862 23-Sep-2012 alc

Address a race condition that was introduced in r238212. Unless the page
queues lock is acquired before the page lock is released, there is no
guarantee that the page will still be in that same page queue when
vm_page_requeue() is called.

Reported by: pho
In collaboration with: kib
MFC after: 3 days


240741 20-Sep-2012 kib

Plug the accounting leak for the wired pages when msync(MS_INVALIDATE)
is performed on the vnode mapping which is wired in other address space.

While there, explicitely assert that the page is unwired and zero the
wire_count instead of substract. The condition is rechecked later in
vm_page_free(_toq) already.

Reported and tested by: zont
Reviewed by: alc (previous version)
MFC after: 1 week


240676 18-Sep-2012 glebius

If caller specifies UMA_ZONE_OFFPAGE explicitly, then do not waste memory
in an allocation for a slab.

Reviewed by: jeff


240518 14-Sep-2012 eadler

Correct double "the the"

Approved by: cperciva
MFC after: 3 days


240145 05-Sep-2012 zont

- Simplify VM code by using vmspace_wired_count() for counting wired
memory of a process.

Reviewed by: avg
Approved by: kib (mentor)
MFC after: 2 weeks


240134 05-Sep-2012 des

Whitespace cleanup.


240113 04-Sep-2012 des

No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.

(for real this time)


240105 04-Sep-2012 des

Revert previous commit, which was performed in the wrong tree.


240096 04-Sep-2012 des

No memory barrier is required. This was pointed out by kib@ a while ago,
but I got distracted by other matters.


240069 03-Sep-2012 zont

- After r240026 sgrowsiz should be used in a safer maner.

Approved by: kib (mentor)
MCF after: 1 week


239895 30-Aug-2012 zont

- Remove accounting of locked memory from vsunlock(9) that I missed in r239818.

Approved by: kib (mentor)


239818 29-Aug-2012 zont

- Don't take an account of locked memory for current process in vslock(9).

There are two consumers of vslock(9): sysctl code and drm driver. These
consumers are using locked memory as transient memory, it doesn't belong
to a process's memory.

Suggested by: avg
Reviewed by: alc
Approved by: kib (mentor)
MFC after: 2 weeks


239723 27-Aug-2012 pluknet

Typo in previous change: print half the theoretical maximum as maximum
recommended amount.

Reported by: <site freebsd at orientalsensation com>
Reviewed by: des


239710 26-Aug-2012 glebius

Fix function name in keg_cachespread_init() assert.


239327 16-Aug-2012 des

- When running out of swzone, instead of spewing an error message every
tick until the situation is resolved (if ever), just print a single
message when running out and another when space becomes available.

- When adding more swap, warn if the total amount exceeds half the
theoretical maximum we can handle.


239250 14-Aug-2012 kib

For old mmap syscall, when executing on amd64 or ia64, enforce the
PROT_EXEC if prot is non-zero, process is 32bit and
kern.elf32.i386_read_exec syscal is enabled. This workaround is needed
for old i386 a.out binaries, where dynamic linker did not specified
PROT_EXEC for mapping of the text.

The kern.elf32.i386_read_exec MIB name looks weird for a.out binaries,
but I reused the existing knob which already has the needed semantic.

MFC after: 1 week


239247 14-Aug-2012 kib

Adjust the r205536, by allowing a non-zero offset for anonymous
mappings for a.out binaries. Apparently, a.out ld.so from FreeBSD
1.1.5.1 can issue such requests.

Reported and tested by: Dan Plassche <dplassche@gmail.com>
MFC after: 1 week


239246 14-Aug-2012 kib

Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by: alc
MFC after: 1 week


239121 07-Aug-2012 alc

Never sleep on busy pages in vm_pageout_launder(), always skip them. Long
ago, sleeping on busy pages in vm_pageout_launder() made sense. The call
to vm_pageout_flush() specified asynchronous I/O and sleeping on busy pages
blocked vm_pageout_launder() until the flush had completed. However, in
CVS revision 1.35 of vm/vm_contig.c, the call to vm_pageout_flush() was
changed to request synchronous I/O, but the sleep on busy pages was not
removed.


239065 05-Aug-2012 kib

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


239040 04-Aug-2012 kib

Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by: alc
MFC after: 1 week


238998 03-Aug-2012 alc

Inline vm_page_aflags_clear() and vm_page_aflags_set().

Add comments stating that neither these functions nor the flags that they
are used to manipulate are part of the KBI.


238915 30-Jul-2012 alc

Eliminate an unneeded declaration. (I should have removed this as part
of r227568.)


238791 26-Jul-2012 kib

Do not requeue held page or page for which locking failed, just leave
them alone.

Process the act_count updates for the held pages in the vm_pageout
loop over the inactive queue, instead of refusing to do anything with
such page.

Clarify the intent of the addl_page_shortage counter and change its
use for pages which are not processed in the loop according to the
description.

Reviewed by: alc
MFC after: 2 weeks


238732 24-Jul-2012 alc

Addendum to r238604. If the inactive queue scan isn't restarted, then
the variable "addl_page_shortage_init" isn't needed.

X-MFC after: r238604


238604 18-Jul-2012 kib

Do not restart scan of the inactive queue when non-inactive page is
found. Rather, we shall not find such pages on inactive queue at all.

Requested and reviewed by: alc
MFC after: 2 weeks


238561 18-Jul-2012 alc

Move what remains of vm/vm_contig.c into vm/vm_pageout.c, where similar
code resides. Rename vm_contig_grow_cache() to vm_pageout_grow_cache().

Reviewed by: kib


238543 17-Jul-2012 alc

Correct vm_page_alloc_contig()'s implementation of VM_ALLOC_NODUMP.


238536 16-Jul-2012 alc

Various improvements to vm_contig_grow_cache(). Most notably, even when
it can't sleep, it can still move clean pages from the inactive queue to
the cache. Also, when a page is cached, there is no need to restart the
scan. The "next" page pointer held by vm_contig_launder() is still
valid. Finally, add a comment summarizing what vm_contig_grow_cache()
does based upon the value of "tries".

MFC after: 3 weeks


238510 15-Jul-2012 alc

Correct an off-by-one error in vm_reserv_alloc_contig() that resulted in
the last reservation of a multi-reservation allocation not being
initialized.


238502 15-Jul-2012 mdf

Fix a bug with memguard(9) on 32-bit architectures without a
VM_KMEM_MAX_SIZE.

The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.

Found by: Justin Hibbits


238456 14-Jul-2012 alc

If vm_contig_grow_cache() is allowed to sleep, then invoke the vm_lowmem
handlers.


238452 14-Jul-2012 alc

Move kmem_alloc_{attr,contig}() to vm/vm_kern.c, where similarly named
functions reside. Correct the comment describing kmem_alloc_contig().


238359 11-Jul-2012 attilio

Document the object type movements, related to swp_pager_copy(),
in vm_object_collapse() and vm_object_split().

In collabouration with: alc
MFC after: 3 days


238258 08-Jul-2012 kib

Avoid vm page queues lock leak after r238212.

Reported and tested by: Michael Butler <imb protected-networks net>
Reviewed by: alc
Pointy hat to: kib
MFC after: 20 days


238212 07-Jul-2012 kib

Drop page queues mutex on each iteration of vm_pageout_scan over the
inactive queue, unless busy page is found.

Dropping the mutex often should allow the other lock acquires to
proceed without waiting for whole inactive scan to finish. On machines
with lot of physical memory scan often need to iterate a lot before it
finishes or finds a page which requires laundring, causing high
latency for other lock waiters.

Suggested and reviewed by: alc
MFC after: 3 weeks


238206 07-Jul-2012 eadler

Add missing sleep stat increase

PR: kern/168211
Submitted by: linimon
Reviewed by: alc
Approved by: cperciva
MFC after: 3 days


238180 06-Jul-2012 kib

Style.

Reviewed by: alc (previous version)
MFC after: 1 week


238000 02-Jul-2012 jhb

Honor db_pager_quit in 'show uma' and 'show malloc'.

MFC after: 1 month


237623 27-Jun-2012 alc

Add new pmap layer locks to the predefined lock order. Change the names
of a few existing VM locks to follow a consistent naming scheme.


237451 22-Jun-2012 attilio

- Add a comment explaining the locking of the cached pages pool held
by vm_objects.
- Add flags for the per-object lock and free pages queue mutex lock.
Use the newly added flags to mark the cache root within the vm_object
structure.

Please note that other vm_object members should be marked with correct
locking but they are left for other commits.

In collabouration with: alc

MFC after: 3 days3 days3 days


237346 20-Jun-2012 alc

Selectively inline vm_page_dirty().


237334 20-Jun-2012 jhb

Move the per-thread deferred user map entries list into a private list
in vm_map_process_deferred() which is then iterated to release map entries.
This avoids having a nested vm map unlock operation called from the loop
body attempt to recuse into vm_map_process_deferred(). This can happen if
the vm_map_remove() triggers the OOM killer.

Reviewed by: alc, kib
MFC after: 1 week


237172 16-Jun-2012 attilio

Do a more targeted check on the page cache and avoid to check the cache
pointer directly in vnode_pager_setsize() by using newly introduced
vm_page_is_cached() function.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039,234064


237168 16-Jun-2012 alc

The page flag PGA_WRITEABLE is set and cleared exclusively by the pmap
layer, but it is read directly by the MI VM layer. This change introduces
pmap_page_is_write_mapped() in order to completely encapsulate all direct
access to PGA_WRITEABLE in the pmap layer.

Aesthetics aside, I am making this change because amd64 will likely begin
using an alternative method to track write mappings, and having
pmap_page_is_write_mapped() in place allows me to make such a change
without further modification to the MI VM layer.

As an added bonus, tidy up some nearby comments concerning page flags.

Reviewed by: kib
MFC after: 6 weeks


236848 10-Jun-2012 kib

Use the previous stack entry protection and max protection to correctly
propagate the stack execution permissions when stack is grown down.

First, curproc->p_sysent->sv_stackprot specifies maximum allowed stack
protection for current ABI, so the new stack entry was typically marked
executable always. Second, for non-main stack MAP_STACK mapping,
the PROT_ flags should be used which were specified at the mmap(2) call
time, and not sv_stackprot.

MFC after: 1 week


236417 01-Jun-2012 eadler

Revert r236380

PR: kern/166780
Requested by: many
Approved by: cperciva (implicit)


236380 01-Jun-2012 eadler

Add sysctl to query amount of swap space free

PR: kern/166780
Submitted by: Radim Kolar <hsn@sendmail.cz>
Approved by: cperciva
MFC after: 1 week


235854 23-May-2012 emax

Tweak condition for disabling allocation from per-CPU buckets in
low memory situation. I've observed a situation where per-CPU
allocations were disabled while there were enough free cached pages.
Basically, cnt.v_free_count was sitting stable at a value lower
than cnt.v_free_min and that caused massive performance drop.

Reviewed by: alc
MFC after: 1 week


235850 23-May-2012 kib

Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by: Andrey Zonov <andrey zonov org>
MFC after: 1 week


235829 23-May-2012 avg

vm_pager_object_lookup: small performance optimization

do not needlessly lock an object if its handle doesn't match

Reviewed by: kib, alc
MFC after: 1 week


235776 22-May-2012 andrew

Fix booting on ARM.

In PHYS_TO_VM_PAGE() when VM_PHYSSEG_DENSE is set the check if we are past
the end of vm_page_array was incorrect causing it to return NULL. This
value is then used in vm_phys_add_page causing a data abort.

Reviewed by: alc, kib, imp
Tested by: stas


235689 20-May-2012 nwhitehorn

Replace the list of PVOs owned by each PMAP with an RB tree. This simplifies
range operations like pmap_remove() and pmap_protect() as well as allowing
simple operations like pmap_extract() not to involve any global state.
This substantially reduces lock coverages for the global table lock and
improves concurrency.


235603 18-May-2012 kib

Do not double-reference the found vm object in cdev_pager_lookup().
vm_pager_object_lookup() already referenced the object.

Note that there is no in-tree consumers of cdev_pager_lookup(). The
only known user of the function is i915 gem driver, which is not yet
imported. This should make the KPI change minor.

Submitted by: avg
MFC after: 1 week


235375 12-May-2012 kib

Add new pager type, OBJT_MGTDEVICE. It provides the device pager
which carries fictitous managed pages. In particular, the consumers of
the new object type can remove all mappings of the device page with
pmap_remove_all().

The range of physical addresses used for fake page allocation shall be
registered with vm_phys_fictitious_reg_range() interface to allow the
PHYS_TO_VM_PAGE() to work in pmap.

Most likely, only i386 and amd64 pmaps can handle fictitious managed
pages right now.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235372 12-May-2012 kib

Add a facility to register a range of physical addresses to be used
for allocation of fictitious pages, for which PHYS_TO_VM_PAGE()
returns proper fictitious vm_page_t. The range should be de-registered
after consumer stopped using it.

De-inline the PHYS_TO_VM_PAGE() since it now carries code to iterate
over registered ranges.

A hash container might be developed instead of range registration
interface, and fake pages could be put automatically into the hash,
were PHYS_TO_VM_PAGE() could look them up later. This should be
considered before the MFC of the commit is done.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235366 12-May-2012 kib

Split the code from vm_page_getfake() to initialize the fake page struct
vm_page into new interface vm_page_initfake(). Handle the case of fake
page re-initialization with changed memattr.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235365 12-May-2012 kib

Assert that the page passed to vm_page_putfake() is unmanaged.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235362 12-May-2012 kib

Assert that fictitious or unmanaged pages do not appear on
active/inactive lists.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235359 12-May-2012 kib

Commit the change forgotten in r235356.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235356 12-May-2012 kib

Make the vm_page_array_size long. Remove redundand zero initialization
for vm_page_array_size and nearby variablees.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


235230 10-May-2012 alc

Give vm_fault()'s sequential access optimization a makeover.

There are two aspects to the sequential access optimization: (1) read ahead
of pages that are expected to be accessed in the near future and (2) unmap
and cache behind of pages that are not expected to be accessed again. This
revision changes both aspects.

The read ahead optimization is now more effective. It starts with the same
initial read window as before, but arithmetically grows the window on
sequential page faults. This can yield increased read bandwidth. For
example, on one of my machines, a program using mmap() to read a file that
is several times larger than the machine's physical memory takes about 17%
less time to complete.

The unmap and cache behind optimization is now more selectively applied.
The read ahead window must grow to its maximum size before unmap and cache
behind is performed. This significantly reduces the number of times that
pages are unmapped and cached only to be reactivated a short time later.

The unmap and cache behind optimization now clears each page's referenced
flag. Previously, in the case of dirty pages, if the containing file was
still mapped at the time that the page daemon examined the dirty pages,
they would be reactivated.

From a stylistic standpoint, this revision also cleanly separates the
implementation of the read ahead and unmap/cache behind optimizations.

Glanced at: kib
MFC after: 2 weeks


234576 22-Apr-2012 nwhitehorn

Avoid a lock order reversal in pmap_extract_and_hold() from relocking
the page. This PMAP requires an additional lock besides the PMAP lock
in pmap_extract_and_hold(), which vm_page_pa_tryrelock() did not release.

Suggested by: kib
MFC after: 4 days


234556 21-Apr-2012 kib

When MAP_STACK mapping is created, the map entry is created only to
cover the initial stack size. For MCL_WIREFUTURE maps, the subsequent
call to vm_map_wire() to wire the whole stack region fails due to
VM_MAP_WIRE_NOHOLES flag.

Use the VM_MAP_WIRE_HOLESOK to only wire mapped part of the stack.

Reported and tested by: Sushanth Rai <sushanth_rai yahoo com>
Reviewed by: alc
MFC after: 1 week


234554 21-Apr-2012 alc

As documented in vm_page.h, updates to the vm_page's flags no longer
require the page queues lock.

MFC after: 1 week


234064 09-Apr-2012 attilio

- Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039


234039 08-Apr-2012 alc

Fix mincore(2) so that it reports PG_CACHED pages as resident.

MFC after: 2 weeks


234038 08-Apr-2012 alc

If a page belonging a reservation is cached, then mark the reservation so
that it will be freed to the cache pool rather than the default pool.
Otherwise, the cached pages within the reservation may be recycled sooner
than necessary.

Reported by: Andrey Zonov


233960 06-Apr-2012 attilio

Staticize vm_page_cache_remove().

Reviewed by: alc


233949 06-Apr-2012 nwhitehorn

Reduce the frequency that the PowerPC/AIM pmaps invalidate instruction
caches, by invalidating kernel icaches only when needed and not flushing
user caches for shared pages.

Suggested by: kib
MFC after: 2 weeks


233925 05-Apr-2012 jhb

Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take. The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by: kib
MFC after: 2 weeks


233627 28-Mar-2012 mckusick

Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib


233291 22-Mar-2012 alc

Handle spurious page faults that may occur in no-fault sections of the
kernel.

When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.

In collaboration with: kib
Tested by: gibbs (an earlier version)

I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.

MFC after: 1 week


233194 19-Mar-2012 jhb

Bah, just revert my earlier change entirely. (Missed alc's request to do
this earlier.)

Requested by: alc


233191 19-Mar-2012 jhb

Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger
than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int'
count of pages overflowed on 64-bit platforms. While here, change
vm_object_madvise() to accept two vm_pindex_t parameters (start and end)
rather than a (start, count) tuple to match other VM APIs as suggested
by alc@.


233190 19-Mar-2012 jhb

Alter the previous commit to use vm_size_t instead of vm_pindex_t.
vm_pindex_t is not a count of pages per se, it is more like vm_ooffset_t,
but a page index instead of a byte offset.


233100 17-Mar-2012 kib

In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag
if the filesystem performed short write and we are skipping the page
due to this.

Propogate write error from the pager back to the callers of
vm_pageout_flush(). Report the failure to write a page from the
requested range as the FALSE return value from vm_object_page_clean(),
and propagate it back to msync(2) to return EIO to usermode.

While there, convert the clearobjflags variable in the
vm_object_page_clean() and arguments of the helper functions to
boolean.

PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks


232984 14-Mar-2012 jhb

Pedantic nit: use vm_pindex_t instead of long for a count of pages.


232701 08-Mar-2012 jhb

Add KTR_VFS traces to track modifications to a vnode's writecount.


232399 02-Mar-2012 alc

Eliminate stale incorrect ARGSUSED comments.

Submitted by: bde


232288 29-Feb-2012 alc

Simplify kmem_alloc() by eliminating code that existed on account of
external pagers in Mach. FreeBSD doesn't implement external pagers.
Moreover, it don't pageout the kernel object. So, the reasons for
having code don't hold.

Reviewed by: kib
MFC after: 6 weeks


232166 25-Feb-2012 alc

Simplify vm_mmap()'s control flow.

Add a comment describing what vm_mmap_to_errno() does.

Reviewed by: kib
MFC after: 3 weeks
X-MFC after: r232071


232160 25-Feb-2012 alc

Simplify vmspace_fork()'s control flow by copying immutable data before
the vm map locks are acquired. Also, eliminate redundant initialization
of the new vm map's timestamp.

Reviewed by: kib
MFC after: 3 weeks


232103 24-Feb-2012 kib

Place the if() at the right location, to activate the v_writecount
accounting for shared writeable mappings for all filesystems, not only
for the bypass layers.

Submitted by: alc
Pointy hat to: kib
MFC after: 20 days


232071 23-Feb-2012 kib

Account the writeable shared mappings backed by file in the vnode
v_writecount. Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry. During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by: tegge
Reviewed by: tegge (long time ago, early version), alc
Tested by: pho
MFC after: 3 weeks


232002 22-Feb-2012 kib

Remove wrong comment.

Discussed with: alc
MFC after: 3 days


231819 16-Feb-2012 alc

When vm_mmap() is used to map a vm object into a kernel vm_map, it
makes no sense to check the size of the kernel vm_map against the
user-level resource limits for the calling process.

Reviewed by: kib


231526 11-Feb-2012 kib

Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.

Noted and reviewed by: alc
MFC after: 1 week


231378 10-Feb-2012 ed

Remove direct access to si_name.

Code should just use the devtoname() function to obtain the name of a
character device. Also add const keywords to pieces of code that need it
to build properly.

MFC after: 2 weeks


230877 01-Feb-2012 mav

Fix NULL dereference panic on attempt to turn off (on system shutdown)
disconnected swap device.

This is quick and imperfect solution, as swap device will still be opened
and GEOM will not be able to destroy it. Proper solution would be to
automatically turn off and close disconnected swap device, but with existing
code it will cause panic if there is at least one page on device, even if
it is unimportant page of the user-level process. It needs some work.

Reviewed by: kib@
MFC after: 1 week


230623 27-Jan-2012 kmacy

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


230247 17-Jan-2012 nwhitehorn

Revert r212360 now that PowerPC can handle large sparse arguments to
pmap_remove() (changed in r228412).

MFC after: 2 weeks


229934 10-Jan-2012 kib

Change the type of the paging_in_progress refcounter from u_short to
u_int. With the auto-sized buffer cache on the modern machines, UFS
metadata can generate more the 65535 pages belonging to the buffers
undergoing i/o, overflowing the counter.

Reported and tested by: jimharris
Reviewed by: alc
MFC after: 1 week


229495 04-Jan-2012 kib

Do not restart the scan in vm_object_page_clean() on the object
generation change if requested mode is async. The object generation is
only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async
mode it is enough to write each dirty page, not to make a guarantee that
all pages are cleared after the vm_object_page_clean() returned.

Diagnosed by: truckman
Tested by: flo
Reviewed by: alc, truckman
MFC after: 2 weeks


228936 28-Dec-2011 alc

Optimize vm_object_split()'s handling of reservations.


228838 23-Dec-2011 kib

Optimize the common case of msyncing the whole file mapping with
MS_SYNC flag. The system must guarantee that all writes are finished
before syscalls returned. Schedule the writes in async mode, which is
much faster and allows the clustering to occur. Wait for writes using
VOP_FSYNC(), since we are syncing the whole file mapping.

Potentially, the restriction to only apply the optimization can be
relaxed by not requiring that the mapping cover whole file, as it is
done by other OSes.

Reported and tested by: az
Reviewed by: alc
MFC after: 2 weeks


228567 16-Dec-2011 kib

Move kstack_cache_entry into the private header, and make the
stack cache list header accessible outside vm_glue.c.

MFC after: 1 week


228498 14-Dec-2011 eadler

- The previous commit (r228449) accidentally moved the vm.stats.vm.* sysctls
to vm.stats.sys. Move them back.

Noticed by: pho
Reviewed by: bde (earlier version)
Approved by: bz
MFC after: 1 week
Pointy hat to: me


228449 13-Dec-2011 eadler

Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR: kern/155491
PR: kern/155490
PR: kern/155489
Submitted by: Galimov Albert <wtfcrap@mail.ru>
Approved by: bde
Reviewed by: jhb
MFC after: 1 week


228432 12-Dec-2011 kib

Fix printf.

Submitted by: az
MFC after: 1 week


228287 05-Dec-2011 alc

Introduce vm_reserv_alloc_contig() and teach vm_page_alloc_contig() how to
use superpage reservations. So, for the first time, kernel virtual memory
that is allocated by contigmalloc(), kmem_alloc_attr(), and
kmem_alloc_contig() can be promoted to superpages. In fact, even a series
of small contigmalloc() allocations may collectively result in a promoted
superpage.

Eliminate some duplication of code in vm_reserv_alloc_page().

Change the type of vm_reserv_reclaim_contig()'s first parameter in order
that it be consistent with other vm_*_contig() functions.

Tested by: marius (sparc64)


228156 30-Nov-2011 kib

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


228133 29-Nov-2011 kib

Hide the internals of vm_page_lock(9) from the loadable modules.
Since the address of vm_page lock mutex depends on the kernel options,
it is easy for module to get out of sync with the kernel.

No vm_page_lockptr() accessor is provided for modules. It can be added
later if needed, unless proper KPI is developed to serve the needs.

Reviewed by: attilio, alc
MFC after: 3 weeks


227788 21-Nov-2011 attilio

Introduce the same mutex-wise fix in r227758 for sx locks.

The functions that offer file and line specifications are:
- sx_assert_
- sx_downgrade_
- sx_slock_
- sx_slock_sig_
- sx_sunlock_
- sx_try_slock_
- sx_try_xlock_
- sx_try_upgrade_
- sx_unlock_
- sx_xlock_
- sx_xlock_sig_
- sx_xunlock_

Now vm_map locking is fully converted and can avoid to know specifics
about locking procedures.
Reviewed by: kib
MFC after: 1 month


227758 20-Nov-2011 attilio

Introduce macro stubs in the mutex implementation that will be always
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.

The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_

Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
ppbus code by using the same macro as done in this patch (but this is
left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
the most notable case is sx which will allow a further cleanup of
vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
is previewed (infact it uses all the underlying mechanisms already
present).

Comments review by: eadler, Ben Kaduk
Discussed with: kib, jhb
MFC after: 1 month


227606 17-Nov-2011 alc

Eliminate end-of-line white space.


227568 16-Nov-2011 alc

Refactor the code that performs physically contiguous memory allocation,
yielding a new public interface, vm_page_alloc_contig(). This new function
addresses some of the limitations of the current interfaces, contigmalloc()
and kmem_alloc_contig(). For example, the physically contiguous memory that
is allocated with those interfaces can only be allocated to the kernel vm
object and must be mapped into the kernel virtual address space. It also
provides functionality that vm_phys_alloc_contig() doesn't, such as wiring
the returned pages. Moreover, unlike that function, it respects the low
water marks on the paging queues and wakes up the page daemon when
necessary. That said, at present, this new function can't be applied to all
types of vm objects. However, that restriction will be eliminated in the
coming weeks.

From a design standpoint, this change also addresses an inconsistency
between vm_phys_alloc_contig() and the other vm_phys_alloc*() functions.
Specifically, vm_phys_alloc_contig() manipulated vm_page fields that other
functions in vm/vm_phys.c didn't. Moreover, vm_phys_alloc_contig() knew
about vnodes and reservations. Now, vm_page_alloc_contig() is responsible
for these things.

Reviewed by: kib
Discussed with: jhb


227530 15-Nov-2011 kib

Update the device pager interface, while keeping the compatibility
layer for old KPI and KBI. New interface should be used together with
d_mmap_single cdevsw method.

Device pager can be allocated with the cdev_pager_allocate(9)
function, which takes struct cdev_pager_ops, containing
constructor/destructor and page fault handler methods supplied by
driver.

Constructor and destructor, called at the pager allocation and
deallocation time, allow the driver to handle per-object private data.

The pager handler is called to handle page fault on the vm map entry
backed by the driver pager. Driver shall return either the vm_page_t
which should be mapped, or error code (which does not cause kernel
panic anymore). The page handler interface has a placeholder to
specify the access mode causing the fault, but currently PROT_READ is
always passed there.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc
MFC after: 1 month


227529 15-Nov-2011 kib

Remove the condition that is always true.

Submitted by: alc
MFC after: 1 week


227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


227127 06-Nov-2011 alc

Wake up the page daemon in vm_page_alloc_freelist() if it couldn't
allocate the requested page because too few pages are cached or free.

Document the VM_ALLOC_COUNT() option to vm_page_alloc() and
vm_page_alloc_freelist().

Make style changes to vm_page_alloc() and vm_page_alloc_freelist(),
such as using a variable name that more closely corresponds to the
comments.


227103 05-Nov-2011 kib

Remove redundand definitions. The chunk was missed from r227102.

MFC after: 2 weeks


227102 05-Nov-2011 kib

Provide typedefs for the type of bit mask for the page bits.
Use the defined types instead of int when manipulating masks.
Supposedly, it could fix support for 32KB page size in the
machine-independend VM layer.

Reviewed by: alc
MFC after: 2 weeks


227072 04-Nov-2011 alc

Simplify the implementation of the failure case in kmem_alloc_attr().


227070 04-Nov-2011 jhb

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


227012 02-Nov-2011 alc

Add support for VM_ALLOC_WIRED and VM_ALLOC_ZERO to vm_page_alloc_freelist()
and use these new options in the mips pmap.

Wake up the page daemon in vm_page_alloc_freelist() if the number of free
and cached pages becomes too low.

Tidy up vm_page_alloc_init(). In particular, add a comment about an
important restriction on its use.

Tested by: jchandra@


226928 30-Oct-2011 alc

Eliminate vm_phys_bootstrap_alloc(). It was a failed attempt at
eliminating duplicated code in the various pmap implementations.

Micro-optimize vm_phys_free_pages().

Introduce vm_phys_free_contig(). It is fast routine for freeing an
arbitrary number of physically contiguous pages. In particular, it
doesn't require the number of pages to be a power of two.

Use "u_long" instead of "unsigned long".

Bruce Evans (bde@) has convinced me that the "boundary" parameters
to kmem_alloc_contig(), vm_phys_alloc_contig(), and
vm_reserv_reclaim_contig() should be of type "vm_paddr_t" and not
"u_long". Make this change.


226891 28-Oct-2011 alc

Use "u_long" instead of "unsigned long".


226848 27-Oct-2011 alc

Tidy up the comment at the head of vm_page_alloc, and mention that the
returned page has the flag VPO_BUSY set.


226843 27-Oct-2011 alc

Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to
vm_page_alloc(). While I'm here, for the sake of consistency, always
specify the allocation class, such as VM_ALLOC_NORMAL, as the first of
the flags.


226824 27-Oct-2011 alc

contigmalloc(9) and contigfree(9) are now implemented in terms of other
more general VM system interfaces. So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.


226740 25-Oct-2011 alc

Speed up vm_page_cache() and vm_page_remove() by checking for a few
common cases that can be handled in constant time. The insight being
that a page's parent in the vm object's tree is very often its
predecessor or successor in the vm object's ordered memq.

Tested by: jhb
MFC after: 10 days


226642 22-Oct-2011 attilio

VN_NRESERVLEVEL is used in this file but opt_vm is not included
thus the stub switch won't be correctly handled.
Include opt_vm.h.

Submitted by: jeff
MFC after: 3 days


226388 15-Oct-2011 kib

Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by: marcel


226366 14-Oct-2011 jhb

Fix a typo in a comment.


226343 13-Oct-2011 marcel

In sys_obreak() and when compiling for amd64 or ia64, when the process
is ILP32 (i.e. i386) grant execute permissions by default. The JDK 1.4.x
depends on being able to execute from the heap on i386.


226313 12-Oct-2011 glebius

Make memguard(9) capable to guard uma(9) allocations.


225856 29-Sep-2011 kib

Style nit.

Submitted by: jhb
MFC after: 2 weeks


225843 28-Sep-2011 kib

Fix grammar.

Submitted by: bf
MFC after: 2 weeks


225840 28-Sep-2011 kib

Use the trick of performing the atomic operation on the contained aligned
word to handle the dirty mask updates in vm_page_clear_dirty_mask().
Remove the vm page queue lock around vm_page_dirty() call in vm_fault_hold()
the sole purpose of which was to protect dirty on architectures which
does not provide short or byte-wide atomics.

Reviewed by: alc, attilio
Tested by: flo (sparc64)
MFC after: 2 weeks


225838 28-Sep-2011 kib

Use the explicitly-sized types for the dirty and valid masks.

Requested by: attilio
Reviewed by: alc
MFC after: 2 weeks


225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


225418 06-Sep-2011 kib

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


225089 22-Aug-2011 kib

Update some comments in swap_pager.c.

Reviewed and most wording by: alc
MFC after: 1 week
Approved by: re (bz)


225076 22-Aug-2011 kib

Apply the limit to avoid the overflows in the radix tree subr_blist.c
after the conversion of the swap device size to the page size units,
not before. That lifts the limit on the usable swap partition size
from 32GB to 256GB, that is less depressing for the modern systems.

Submitted by: Alexander V. Chernikov <melifaro ipfw ru>
Reviewed by: alc
Approved by: re (bz)
MFC after: 2 weeks


224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


224746 09-Aug-2011 kib

- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag
to VPO_UNMANAGED (and also making the flag protected by the vm object
lock, instead of vm page queue lock).
- Mark the fake pages with both PG_FICTITIOUS (as it is now) and
VPO_UNMANAGED. As a consequence, pmap code now can use use just
VPO_UNMANAGED to decide whether the page is unmanaged.

Reviewed by: alc
Tested by: pho (x86, previous version), marius (sparc64),
marcel (arm, ia64, powerpc), ray (mips)
Sponsored by: The FreeBSD Foundation
Approved by: re (bz)


224689 07-Aug-2011 alc

Fix an error in kmem_alloc_attr(). Unless "tries" is updated,
kmem_alloc_attr() could get stuck in a loop.

Approved by: re (kib)
MFC after: 3 days


224582 01-Aug-2011 kib

Implement the linprocfs swaps file, providing information about the
configured swap devices in the Linux-compatible format.

Based on the submission by: Robert Millan <rmh debian org>
PR: kern/159281
Reviewed by: bde
Approved by: re (kensmith)
MFC after: 2 weeks


224522 30-Jul-2011 kib

Fix a race in the device pager allocation. If another thread won and
allocated the device pager for the given handle, then the object
fictitious pages list and the object membership in the global object
list still need to be initialized. Otherwise, dev_pager_dealloc() will
traverse uninitialized pointers.

Reported and tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
MFC after: 1 week


223914 10-Jul-2011 kib

Extract the code to translate VM error into errno, into an exported
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).

Sponsored by: The FreeBSD Foundation
No objections from: alc
MFC after: 1 week


223913 10-Jul-2011 kib

Style.

MFC after: 3 days


223889 09-Jul-2011 kib

Add a facility to disable processing page faults. When activated,
uiomove generates EFAULT if any accessed address is not mapped, as
opposed to handling the fault.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc (previous version)


223825 06-Jul-2011 trasz

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


223823 06-Jul-2011 attilio

Handle a race between device_pager and devsw in a more graceful manner:
return an error code rather than panic the kernel.

Sponsored by: Sandvine Incorporated
Reviewed by: kib
Tested by: pho
MFC after: 2 weeks


223729 02-Jul-2011 alc

Initialize marker pages as held rather than fictitious/wired. Marking the
page as held is more useful as a safety precaution in case someone forgets
to check for PG_MARKER.

Reviewed by: kib


223677 29-Jun-2011 alc

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


223464 23-Jun-2011 alc

Revert to using the page queues lock in vm_page_clear_dirty_mask() on
MIPS. (At present, although atomic_clear_char() is defined by atomic.h
on MIPS, it is not actually implemented by support.S.)


223307 19-Jun-2011 alc

Precisely document the synchronization rules for the page's dirty field.
(Saying that the lock on the object that the page belongs to must be held
only represents one aspect of the rules.)

Eliminate the use of the page queues lock for atomically performing read-
modify-write operations on the dirty field when the underlying architecture
supports atomic operations on char and short types.

Document the fact that 32KB pages aren't really supported.

Reviewed by: attilio, kib


222992 11-Jun-2011 kib

Assert that page is VPO_BUSY or page owner object is locked in
vm_page_undirty(). The assert is not precise due to VPO_BUSY owner
to tracked, so assertion does not catch the case when VPO_BUSY is
owned by other thread.

Reviewed by: alc


222991 11-Jun-2011 kib

Fix a bug in r222586. Lock the page owner object around the modification
of the m->dirty.

Reported and tested by: nwhitehorn
Reviewed by: alc


222586 01-Jun-2011 kib

In the VOP_PUTPAGES() implementations, change the default error from
VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return
VM_PAGER_AGAIN for the partially written page. Always forward at least
one page in the loop of vm_object_page_clean().

VM_PAGER_ERROR causes the page reactivation and does not clear the
page dirty state, so the write is not lost.

The change fixes an infinite loop in vm_object_page_clean() when the
filesystem returns permanent errors for some page writes.

Reported and tested by: gavin
Reviewed by: alc, rmacklem
MFC after: 1 week


222184 22-May-2011 alc

Correct an error in r222163. Unless UMA_MD_SMALL_ALLOC is defined,
startup_alloc() must be used until uma_startup2() is called.

Reported by: jh


222163 21-May-2011 alc

1. Prior to r214782, UMA did not support multipage allocations before
uma_startup2() was called. Thus, setting the variable "booted" to true in
uma_startup() was ok on machines with UMA_MD_SMALL_ALLOC defined, because
any allocations made after uma_startup() but before uma_startup2() could be
satisfied by uma_small_alloc(). Now, however, some multipage allocations
are necessary before uma_startup2() just to allocate zone structures on
machines with a large number of processors. Thus, a Boolean can no longer
effectively describe the state of the UMA allocator. Instead, make "booted"
have three values to describe how far initialization has progressed. This
allows multipage allocations to continue using startup_alloc() until
uma_startup2(), but single-page allocations may begin using
uma_small_alloc() after uma_startup().

2. With the aforementioned change, only a modest increase in boot pages is
necessary to boot UMA on a large number of processors.

3. Retire UMA_MD_SMALL_ALLOC_NEEDS_VM. It has only been used between
r182028 and r204128.

Reviewed by: attilio [1], nwhitehorn [3]
Tested by: sbruno


222137 20-May-2011 alc

Fix spelling errors.


222132 20-May-2011 alc

Eliminate a redundant #include. ("vm/vm_param.h" already includes
"machine/vmparam.h".)


221855 13-May-2011 mdf

Move the ZERO_REGION_SIZE to a machine-dependent file, as on many
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.

Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).

Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.

Requested by: alc
MFC after: 1 week
MFC with: r221853


221853 13-May-2011 mdf

Usa a globally visible region of zeros for both /dev/zero and the md
device. There are likely other kernel uses of "blob of zeros" than can
be converted.

Reviewed by: alc
MFC after: 1 week


221714 09-May-2011 mlaier

Another long standing vm bug found at Isilon:
Fix a race between vm_object_collapse and vm_fault.

Reviewed by: alc@
MFC after: 3 days


221096 26-Apr-2011 obrien

Reap old SPL comments.

Reviewed by: alc


220977 23-Apr-2011 kib

Fix two bugs in r218670.

Hold the vnode around the region where object lock is dropped, until
vnode lock is acquired.

Do not drop the vnode reference for a case when the object was
deallocated during unlock. Note that in this case, VV_TEXT is cleared
by vnode_pager_dealloc().

Reported and tested by: pho
Reviewed by: alc
MFC after: 3 days


220390 06-Apr-2011 jhb

Fix several places to ignore processes that are not yet fully constructed.

MFC after: 1 week


220387 06-Apr-2011 trasz

In vm_daemon(), do not skip processes stopped with SIGSTOP.


220386 06-Apr-2011 trasz

Add RACCT_RSS.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


220373 05-Apr-2011 trasz

Add accounting for most of the memory-related resources.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


220001 25-Mar-2011 kib

Handle the corner case in vm_fault_quick_hold_pages().

If supplied length is zero, and user address is invalid, function
might return -1, due to the truncation and rounding of the address.
The callers interpret the situation as EFAULT. Instead of handling
the zero length in caller, filter it in vm_fault_quick_hold_pages().

Sponsored by: The FreeBSD Foundation
Reviewed by: alc


219968 24-Mar-2011 jhb

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


219819 21-Mar-2011 jeff

- Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
and other miscellaneous small features.


219727 18-Mar-2011 trasz

In vm_daemon(), when iterating over all processes in the system, skip those
which are not yet fully initialized (i.e. ones with p_state == PRS_NEW).
Without it, we could panic in _thread_lock_flags().

Note that there may be other instances of FOREACH_PROC_IN_SYSTEM() that
require similar fix.

Reported by: pho, keramida
Discussed with: kib


219476 11-Mar-2011 alc

Eliminate duplication of the fake page code and zone by the device and sg
pagers.

Reviewed by: jhb


219124 01-Mar-2011 brucec

Change the return type of vmspace_swap_count to a long to match the other
vmspace_*_count functions.

MFC after: 3 days


218989 24-Feb-2011 pluknet

Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.

The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64). It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.

Reviewed by: alc, kib
Approved by: avg (mentor)
MFC after: 2 months


218966 23-Feb-2011 brucec

Calculate and return the count in vmspace_swap_count as a vm_offset_t
instead of an int to avoid overflow.

While here, clean up some style(9) issues.

PR: kern/152200
Reviewed by: kib
MFC after: 2 weeks


218773 17-Feb-2011 alc

Remove pmap fields that are either unused or not fully implemented.

Discussed with: kib


218701 15-Feb-2011 kib

Since r218070 reenabled the call to vm_map_simplify_entry() from
vm_map_insert(), the kmem_back() assumption about newly inserted
entry might be broken due to interference of two factors. In the low
memory condition, when vm_page_alloc() returns NULL, supplied map is
unlocked. If another thread performs kmem_malloc() meantime, and its
map entry is placed right next to our thread map entry in the map,
both entries wire count is still 0 and entries are coalesced due to
vm_map_simplify_entry().

Mark new entry with MAP_ENTRY_IN_TRANSITION to prevent coalesce.
Fix some style issues, tighten the assertions to account for
MAP_ENTRY_IN_TRANSITION state.

Reported and tested by: pho
Reviewed by: alc


218670 13-Feb-2011 kib

Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note
mentioning that vnode lock is needed.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


218592 12-Feb-2011 jmallett

Use CPU_FOREACH rather than expecting CPUs 0 through mp_ncpus-1 to be present.
Don't micro-optimize the uniprocessor case; use the same loop there.

Submitted by: Bhanu Prakash
Reviewed by: kib, jhb


218589 12-Feb-2011 alc

Retire VFS_BIO_DEBUG. Convert those checks that were still valid into
KASSERT()s and eliminate the rest.

Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().

Reviewed by: kib


218345 05-Feb-2011 alc

Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean(). They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by: kib
MFC after: 3 weeks


218304 04-Feb-2011 alc

Since the last parameter to vm_object_shadow() is a vm_size_t and not a
vm_pindex_t, it makes no sense for its callers to perform atop(). Let
vm_object_shadow() do that instead.


218113 31-Jan-2011 alc

Release the free page queues lock earlier in vm_page_alloc().

Discussed with: kib@


218070 29-Jan-2011 alc

Reenable the call to vm_map_simplify_entry() from vm_map_insert() for non-
MAP_STACK_* entries. (See r71983 and r74235.)

In some cases, performing this call to vm_map_simplify_entry() halves the
number of vm map entries used by the Sun JDK.


217916 27-Jan-2011 mdf

Explicitly wire the user buffer rather than doing it implicitly in
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.

Inspired by: rwatson


217688 21-Jan-2011 pluknet

Make MSGBUF_SIZE kernel option a loader tunable kern.msgbufsize.

Submitted by: perryh pluto.rain.com (previous version)
Reviewed by: jhb
Approved by: kib (mentor)
Tested by: universe


217529 18-Jan-2011 alc

Move the definition of M_VMPGDATA to the swap pager, where the only
remaining uses are.


217508 17-Jan-2011 alc

Explicitly initialize the page's queue field to PQ_NONE instead of relying
on PQ_NONE being zero.

Redefine PQ_NONE and PQ_COUNT so that a page queue isn't allocated for
PQ_NONE.

Reviewed by: kib@


217482 16-Jan-2011 alc

Sort function prototypes.


217479 16-Jan-2011 alc

Update a lock annotation on the page structure.


217478 16-Jan-2011 alc

Shift responsibility for synchronizing access to the page's act_count
field to the object's lock.

Reviewed by: kib@


217477 16-Jan-2011 alc

Clean up the start of vm_page_alloc(). In particular, eliminate an
assertion that is no longer required. Long ago, calls to vm_page_alloc()
from an interrupt handler had to specify VM_ALLOC_INTERRUPT so that
vm_page_alloc() would not attempt to reclaim a PQ_CACHE page from another vm
object. Today, with the synchronization on a vm object's collection of
PQ_CACHE pages, this is no longer an issue. In fact, VM_ALLOC_INTERRUPT now
reclaims PQ_CACHE pages just like VM_ALLOC_{NORMAL,SYSTEM}.

MFC after: 3 weeks


217463 15-Jan-2011 kib

For consistency, use kernel_object instead of &kernel_object_store
when initializing the object mutex. Do the same for kmem_object.

Discussed with: alc
MFC after: 1 week


217453 15-Jan-2011 alc

For some time now, the kernel and kmem objects have been ordinary
OBJT_PHYS objects. Thus, there is no need for handling them specially
in vm_fault(). In fact, this special case handling would have led to
an assertion failure just before the call to pmap_enter().

Reviewed by: kib@
MFC after: 6 weeks


217265 11-Jan-2011 jhb

Remove unneeded includes of <sys/linker_set.h>. Other headers that use
it internally contain nested includes.

Reviewed by: bde


217192 09-Jan-2011 kib

Move repeated MAXSLP definition from machine/vmparam.h to sys/vmmeter.h.
Update the outdated comments describing MAXSLP and the process
selection algorithm for swap out.

Comments wording and reviewed by: alc


217177 09-Jan-2011 alc

Eliminate a redundant alignment directive on the page locks array.


217171 08-Jan-2011 alc

Eliminate the counting of vm_page_pa_tryrelock calls. We really don't
need it anymore. Moreover, its implementation had a type mismatch, a
long is not necessarily an uint64_t. (This mismatch was hidden by
casting.) Move the remaining two counters up a level in the sysctl
hierarchy. There is no reason for them to be under the vm.pmap node.

Reviewed by: kib


216899 03-Jan-2011 alc

Release the page lock early in vm_pageout_clean(). There is no reason to
hold this lock until the end of the function.

With the aforementioned change to vm_pageout_clean(), page locks don't need
to support recursive (MTX_RECURSE) or duplicate (MTX_DUPOK) acquisitions.

Reviewed by: kib


216874 01-Jan-2011 alc

Make a couple refinements to r216799 and r216810. In particular, revise
a comment and move it to its proper place.

Reviewed by: kib


216873 01-Jan-2011 brucec

There can be more than 0x20000000 swap meta blocks allocated if a swap-backed
md(4) device is used. Don't panic when deallocating such a device if swap
has been used.

PR: kern/133170
Discussed with: kib
MFC after: 3 days


216810 29-Dec-2010 kib

Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by: alc (as a part of the bigger patch)
MFC after: 1 month (after r216799 is merged)


216807 29-Dec-2010 alc

There is no point in vm_contig_launder{,_page}() flushing held pages,
instead skip over them. As long as a page is held, it can't be reclaimed by
contigmalloc(M_WAITOK). Moreover, a held page may be undergoing
modification, e.g., vmapbuf(), so even if the hold were released before the
completion of contigmalloc(), the page might have to be flushed again.

MFC after: 3 weeks


216799 29-Dec-2010 kib

Move the increment of vm object generation count into
vm_object_set_writeable_dirty().

Fix an issue where restart of the scan in vm_object_page_clean() did
not removed write permissions for newly added pages or, if the mapping
for some already scanned page changed to writeable due to fault.
Merge the two loops in vm_object_page_clean(), doing the remove of
write permission and cleaning in the same loop. The restart of the
loop then correctly downgrade writeable mappings.

Fix an issue where a second caller to msync() might actually return
before the first caller had actually completed flushing the
pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not
before.

Calls to pmap_is_modified() are not needed after pmap_remove_write()
there.

Proposed, reviewed and tested by: alc
MFC after: 1 week


216772 28-Dec-2010 alc

Correct a typo in vm_fault_quick_hold_pages().

Reported by: Bartosz Stec


216731 27-Dec-2010 alc

Move vm_object_print()'s prototype to the expected place.


216701 26-Dec-2010 alc

Retire vm_fault_quick(). It's no longer used.

Reviewed by: kib@


216699 25-Dec-2010 alc

Introduce and use a new VM interface for temporarily pinning pages. This
new interface replaces the combined use of vm_fault_quick() and
pmap_extract_and_hold() throughout the kernel.

In collaboration with: kib@


216604 20-Dec-2010 alc

Introduce vm_fault_hold() and use it to (1) eliminate a long-standing race
condition in proc_rwmem() and to (2) simplify the implementation of the
cxgb driver's vm_fault_hold_user_pages(). Specifically, in proc_rwmem()
the requested read or write could fail because the targeted page could be
reclaimed between the calls to vm_fault() and vm_page_hold().

In collaboration with: kib@
MFC after: 6 weeks


216511 17-Dec-2010 alc

Implement and use a single optimized function for unholding a set of pages.

Reviewed by: kib@


216425 14-Dec-2010 alc

Change memguard_fudge() so that it can handle km_max being zero. Not
every platform defines VM_KMEM_SIZE_MAX, and on those platforms km_max
will be zero.

Reviewed by: mdf
Tested by: marius


216335 09-Dec-2010 mlaier

Fix a long standing (from the original 4.4BSD lite sources) race between
vmspace_fork and vm_map_wire that would lead to "vm_fault_copy_wired: page
missing" panics. While faulting in pages for a map entry that is being
wired down, mark the containing map as busy. In vmspace_fork wait until the
map is unbusy, before we try to copy the entries.

Reviewed by: kib
MFC after: 5 days
Sponsored by: Isilon Systems, Inc.


216319 09-Dec-2010 jchandra

Revert the vm/vm_page.c change in r216317.

This adds back changes in r216141, which was reverted by the above
check in.


216317 09-Dec-2010 jchandra

swi_vm() for mips.


216186 04-Dec-2010 trasz

Fix comment intentation.


216141 03-Dec-2010 imp

To make minidumps work properly on mips for memory that's direct
mapped and entered via vm_page_setup, keep track of it like we do
for amd64.

# A separate commit will be made to move this to a capability-based ifdef
# rather than arch-based ifdef.

Submitted by: alc@
MFC after: 1 week


216128 02-Dec-2010 trasz

Replace pointer to "struct uidinfo" with pointer to "struct ucred"
in "struct vm_object". This is required to make it possible to account
for per-jail swap usage.

Reviewed by: kib@
Tested by: pho@
Sponsored by: FreeBSD Foundation


216090 01-Dec-2010 alc

Correct an error in the allocation of the vm_page_dump array in
vm_page_startup(). Specifically, the dump_avail array should be used
instead of the phys_avail array to calculate the size of vm_page_dump. For
example, the pages for the message buffer are allocated prior to
vm_page_startup() by subtracting them from the last entry in the phys_avail
array, but the first thing that vm_page_startup() does after creating the
vm_page_dump array is to set the bits corresponding to the message buffer
pages in that array. However, these bits might not actually exist in the
array, because the size of the array is determined by the current value in
the last entry of the phys_avail array. In general, the only reason why
this doesn't always result in an out-of-bounds array access is that the size
of the vm_page_dump array is rounded up to the next page boundary. This
change eliminates that dependence on rounding (and luck).

MFC after: 6 weeks


215973 28-Nov-2010 jchandra

Fix issue noted by alc while reviewing r215938:
The current implementation of vm_page_alloc_freelist() does not handle
order > 0 correctly. Remove order parameter to the function and use it
only for order 0 pages.

Submitted by: alc


215796 24-Nov-2010 kib

After the sleep caused by encountering a busy page, relookup the page.

Submitted and reviewed by: alc
Reprted and tested by: pho
MFC after: 5 days


215610 21-Nov-2010 kib

Eliminate the mab, maf arrays and related variables.

The change also fixes off-by-one error in the calculation of mreq.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 5 days


215597 20-Nov-2010 alc

Optimize vm_object_terminate().

Reviewed by: kib
MFC after: 1 week


215574 20-Nov-2010 kib

The runlen returned from vm_pageout_flush() might be zero legitimately,
when mreq page has status VM_PAGER_AGAIN.

MFC after: 5 days


215538 19-Nov-2010 alc

Reduce the amount of detail printed by vm_page_free_toq() when it panics.

Reviewed by: kib


215508 19-Nov-2010 mlaier

Off by one page in vm_reserv_reclaim_contig(): Also reclaim reservations
with only a single free page if that satisfies the requested size.

MFC after: 3 days
Reviewed by: alc


215471 18-Nov-2010 kib

vm_pageout_flush() might cache the pages that finished write to the
backing storage. Such pages might be then reused, racing with the
assert in vm_object_page_collect_flush() that verified that dirty
pages from the run (most likely, pages with VM_PAGER_AGAIN status) are
write-protected still. In fact, the page indexes for the pages that
were removed from the object page list should be ignored by
vm_object_page_clean().

Return the length of successfully written run from vm_pageout_flush(),
that is, the count of pages between requested page and first page
after requested with status VM_PAGER_AGAIN. Supply the requested page
index in the array to vm_pageout_flush(). Use the returned run length
to forward the index of next page to clean in vm_object_page_clean().

Reported by: avg
Reviewed by: alc
MFC after: 1 week


215469 18-Nov-2010 kib

Only increment object generation count when inserting the page into
object page list. The only use of object generation count now is a
restart of the scan in vm_object_page_clean(), which makes sense to do
on the page addition. Page removals do not affect the dirtiness of the
object, as well as manipulations with the shadow chain.

Suggested and reviewed by: alc
MFC after: 1 week


215321 14-Nov-2010 kib

Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by: swell.k gmail com
MFC after: 1 week


215309 14-Nov-2010 kib

Use symbolic names instead of hardcoding values for magic p_osrel constants.

MFC after: 1 week


215307 14-Nov-2010 kib

Implement a (soft) stack guard page for auto-growing stack mappings.
The unmapped page separates the tip of the stack and possible adjanced
segment, making some uses of stack overflow harder. The stack growing
code refuses to expand the segment to the last page of the reseved
region when sysctl security.bsd.stack_guard_page is set to 1. The
default value for sysctl and accompanying tunable is 0.

Please note that mmap(MAP_FIXED) still can place a mapping right up to
the stack, making continuous region.

Reviewed by: alc
MFC after: 1 week


215093 10-Nov-2010 alc

Enable reservation-based physical memory allocation. Even without the
creation of large page mappings in the pmap, it can provide modest
performance benefits. In particular, for a "buildworld" on a 2x 1GHz
Ultrasparc IIIi it reduced the wall clock time by 2.2% and the system
time by 12.6%.

Tested by: marius@


214953 07-Nov-2010 alc

In case the stack size reaches its limit and its growth must be restricted,
ensure that grow_amount is a multiple of the page size. Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.

Diagnosed and reported by: jilles
Reviewed by: kib
MFC after: 1 week


214903 07-Nov-2010 gonzo

- Add minidump support for FreeBSD/mips


214782 04-Nov-2010 jhb

Update startup_alloc() to support multi-page allocations and allow internal
zones whose objects are larger than a page to use startup_alloc(). This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.

Submitted by: trema
Reviewed by: attilio
MFC after: 1 week


214564 30-Oct-2010 alc

Correct some format strings used by sysctls.

MFC after: 1 week


214144 21-Oct-2010 jhb

- Make 'vm_refcnt' volatile so that compilers won't be tempted to treat
its value as a loop invariant. Currently this is a no-op because
'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
a reference in vmspace_free().

Reviewed by: alc
MFC after: 1 month


214095 20-Oct-2010 avg

PG_BUSY -> VPO_BUSY, PG_WANTED -> VPO_WANTED in manual pages and comments

Reviewed by: alc
MFC after: 4 days


214062 19-Oct-2010 mdf

uma_zfree(zone, NULL) should do nothing, to match free(9).

Noticed by: Ron Steinke <rsteinke at isilon dot com>
MFC after: 3 days


213911 16-Oct-2010 lstewart

Change uma_zone_set_max to return the effective value of "nitems" after
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb


213910 16-Oct-2010 lstewart

- Simplify implementation of uma_zone_get_max.
- Add uma_zone_get_cur which returns the current approximate occupancy of
a zone. This is useful for providing stats via sysctl amongst other things.

Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb
MFC after: 2 weeks


213408 04-Oct-2010 alc

If vm_map_find() is asked to allocate a superpage-aligned region of virtual
addresses that is greater than a superpage in size but not a multiple of
the superpage size, then vm_map_find() is not always expanding the kernel
pmap to support the last few small pages being allocated. These failures
are not commonplace, so this was first noticed by someone porting FreeBSD
to a new architecture. Previously, we grew the kernel page table in
vm_map_findspace() when we found the first available virtual address.
This works most of the time because we always grow the kernel pmap or page
table by an amount that is a multiple of the superpage size. Now, instead,
we defer the call to pmap_growkernel() until we are committed to a range
of virtual addresses in vm_map_insert(). In general, there is another
reason to prefer calling pmap_growkernel() in vm_map_insert(). It makes
it possible for someone to do the equivalent of an mmap(MAP_FIXED) on the
kernel map.

Reported by: Svatopluk Kraus
Reviewed by: kib@
MFC after: 3 weeks


212931 20-Sep-2010 mdf

Replace an XXX comment with the appropriate code.

Submitted by: alc


212873 19-Sep-2010 alc

Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8. This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR: 150260
MFC after: 3 weeks


212868 19-Sep-2010 alc

Make refinements to r212824. In particular, don't make
vm_map_unlock_nodefer() part of the synchronization interface for maps.

Add comments to vm_map_unlock_and_wait() and vm_map_wakeup() describing
how they should be used. In particular, describe the deferred deallocations
issue with vm_map_unlock_and_wait().

Redo the implementation of vm_map_unlock_and_wait() so that it passes
along the caller's file and line information, just like the other map
locking primitives.

Reviewed by: kib
X-MFC after: r212824


212824 18-Sep-2010 kib

Adopt the deferring of object deallocation for the deleted map entries
on map unlock to the lock downgrade and later read unlock operation.

System map entries cannot be backed by OBJT_VNODE objects, no need to
defer deallocation for them. Map entries from user maps do not require
the owner map for deallocation, and can be accumulated in the
thread-local list for freeing when a user map is unlocked.

Move the collection of entries for deferred reclamation into
vm_map_delete(). Create helper vm_map_process_deferred(), that is
called from locations where processing is feasible. Do not process
deferred entries in vm_map_unlock_and_wait() since map_sleep_mtx is
held.

Reviewed by: alc, rstone (previous versions)
Tested by: pho
MFC after: 2 weeks


212750 16-Sep-2010 mdf

Re-add r212370 now that the LOR in powerpc64 has been resolved:

Add a drain function for struct sysctl_req, and use it for a variety
of handlers, some of which had to do awkward things to get a large
enough SBUF_FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing
NUL byte. This behaviour was preserved, though it should not be
necessary.

Reviewed by: phk (original patch)


212572 13-Sep-2010 mdf

Revert r212370, as it causes a LOR on powerpc. powerpc does a few
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.

Requested by: nwhitehorn


212370 09-Sep-2010 mdf

Add a drain function for struct sysctl_req, and use it for a variety of
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.

Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.

Reviewed by: phk


212360 09-Sep-2010 nwhitehorn

On architectures with non-tree-based page tables like PowerPC, every page
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.

MFC after: 6 weeks


212282 07-Sep-2010 rstone

Fix a typo in r212281. uintptr -> uintptr_t

Pointy hat to: rstone

Approved by: emaste (mentor)
MFC after: 2 weeks


212281 07-Sep-2010 rstone

In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock. This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by: fabient
Approved by: emaste (mentor)
MFC after: 2 weeks


212174 03-Sep-2010 avg

vm_page.c: include opt_msgbuf.h for MSGBUF_SIZE use in vm_page_startup

vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped. And most importantly,
struct msgbuf itself will not be included. Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.

MFC after: 5 days


212063 31-Aug-2010 mdf

Have memguard(9) crash with an easier-to-debug message on double-free.

Reviewed by: zml
MFC after: 3 weeks


212058 31-Aug-2010 mdf

The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation. Fix this issue.

Noticed by: alc
Reviewed by: alc
Approved by: zml (mentor)
MFC after: 3 weeks


211937 28-Aug-2010 alc

Add the MAP_PREFAULT_READ option to mmap(2).

Reviewed by: jhb, kib


211396 16-Aug-2010 andre

Add uma_zone_get_max() to obtain the effective limit after a call
to uma_zone_set_max().

The UMA zone limit is not exactly set to the value supplied but
rounded up to completely fill the backing store increment (a page
normally). This can lead to surprising situations where the number
of elements allocated from UMA is higher than the supplied limit
value. The new get function reads back the effective value so that
the supplied limit value can be adjusted to the real limit.

Reviewed by: jeffr
MFC after: 1 week


211229 12-Aug-2010 mdf

Fix compile. It seemed better to have memguard.c include opt_vm.h in
case future compile-time knobs were added that it wants to use.
Also add include guards and forward declarations to vm/memguard.h.

Approved by: zml (mentor)
MFC after: 1 month


211194 11-Aug-2010 mdf

Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time. Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9). Allow setting and varying the malloc type at run-time.
Add knobs to allow:

- randomly guarding memory
- adding un-backed KVA guard pages to detect underflow and overflow
- a lower limit on the size of allocations that are guarded

Reviewed by: alc
Reviewed by: brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from: -arch
Approved by: zml (mentor)
MFC after: 1 month


210923 06-Aug-2010 kib

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


210550 27-Jul-2010 jhb

Very rough first cut at NUMA support for the physical page allocator. For
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.

Reviewed by: alc


210548 27-Jul-2010 trasz

Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.


210545 27-Jul-2010 alc

Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name. The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by: kib


210475 25-Jul-2010 alc

Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer. For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%. It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings. Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings. While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map. In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated. The old size was one page too
large. (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by: kib


210327 21-Jul-2010 jchandra

Redo the page table page allocation on MIPS, as suggested by
alc@.

The UMA zone based allocation is replaced by a scheme that creates
a new free page list for the KSEG0 region, and a new function
in sys/vm that allocates pages from a specific free page list.

This also fixes a race condition introduced by the UMA based page table
page allocation code. Dropping the page queue and pmap locks before
the call to uma_zfree, and re-acquiring them afterwards will introduce
a race condtion(noted by alc@).

The changes are :
- Revert the earlier changes in MIPS pmap.c that added UMA zone for
page table pages.
- Add a new freelist VM_FREELIST_HIGHMEM to MIPS vmparam.h for memory that
is not directly mapped (in 32bit kernel). Normal page allocations will first
try the HIGHMEM freelist and then the default(direct mapped) freelist.
- Add a new function 'vm_page_t vm_page_alloc_freelist(int flind, int
order, int req)' to vm/vm_page.c to allocate a page from a specified
freelist. The MIPS page table pages will be allocated using this function
from the freelist containing direct mapped pages.
- Move the page initialization code from vm_phys_alloc_contig() to a
new function vm_page_alloc_init(), and use this function to initialize
pages in vm_page_alloc_freelist() too.
- Split the function vm_phys_alloc_pages(int pool, int order) to create
vm_phys_alloc_freelist_pages(int flind, int pool, int order), and use
this function from both vm_page_alloc_freelist() and vm_phys_alloc_pages().

Reviewed by: alc


209861 09-Jul-2010 alc

Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit. Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by: kib


209792 08-Jul-2010 kib

Make VM_ALLOC_RETRY flag mandatory for vm_page_grab(). Assert that the
flag is always provided, and unconditionally retry after sleep for the
busy page or failed allocation.

The intent is to remove VM_ALLOC_RETRY eventually.

Proposed and reviewed by: alc


209713 05-Jul-2010 kib

Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by: alc
MFC after: 2 weeks


209702 04-Jul-2010 kib

Several cleanups for the r209686:
- remove unused defines;
- remove unused curgeneration argument for vm_object_page_collect_flush();
- always assert that vm_object_page_clean() is called for OBJT_VNODE;
- move vm_page_find_least() into for() statement initial clause.

Submitted by: alc


209686 04-Jul-2010 kib

Reimplement vm_object_page_clean(), using the fact that vm object memq
is ordered by page index. This greatly simplifies the implementation,
since we no longer need to mark the pages with VPO_CLEANCHK to denote
the progress. It is enough to remember the current position by index
before dropping the object lock.

Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused.
Garbage-collect vm.msync_flush_flags sysctl.

Suggested and reviewed by: alc
Tested by: pho


209685 04-Jul-2010 kib

Introduce a helper function vm_page_find_least(). Use it in several places,
which inline the function.

Reviewed by: alc
Tested by: pho
MFC after: 1 week


209669 03-Jul-2010 alc

Improve the comment and man page for vm_page_alloc(). Specifically,
document one of the optional flags; clarify which of the flags are
optional (and which are not), and remove mention of a restriction on
the reclamation of cached pages that no longer holds since version 7.

MFC after: 1 week


209651 02-Jul-2010 alc

Push down the acquisition of the page queues lock into
vm_pageout_page_stats(). In particular, avoid acquiring the page
queues lock unless iterating over the active queue.


209650 02-Jul-2010 alc

Use vm_page_prev() instead of vm_page_lookup() in the implementation of
vm_fault()'s automatic delete-behind heuristic.
vm_page_prev() is typically faster.


209647 02-Jul-2010 alc

With the demise of page coloring, the page queue macros no longer serve any
useful purpose. Eliminate them.

Reviewed by: kib


209610 30-Jun-2010 alc

Simplify entry to vm_pageout_clean(). Expect the page to be locked.
Previously, the caller unlocked the page, and vm_pageout_clean()
immediately reacquired the page lock. Also, assert rather than test
that the page is neither busy nor held. Since vm_pageout_clean() is
called with the object and page locked, the page can't have changed
state since the caller verified that the page is neither busy nor
held.


209407 21-Jun-2010 alc

Introduce vm_page_next() and vm_page_prev(), and use them in
vm_pageout_clean(). When iterating over a range of pages, these functions
can be cheaper than vm_page_lookup() because their implementation takes
advantage of the vm_object's memq being ordered.

Reviewed by: kib@
MFC after: 3 weeks


209215 15-Jun-2010 sbruno

Add a new column to the output of vmstat -z to indicate the number
of times the system was forced to sleep when requesting a new allocation.

Expand the debugger hook, db_show_uma, to display these results as well.

This has proven to be very useful in out of memory situations when
it is not known why systems have become sluggish or fail in odd ways.

Reviewed by: rwatson alc
Approved by: scottl (mentor) peter
Obtained from: Yahoo Inc.


209173 14-Jun-2010 alc

Eliminate checks for a page having a NULL object in vm_pageout_scan()
and vm_pageout_page_stats(). These checks were recently introduced by
the first page locking commit, r207410, but they are not needed. At
the same time, eliminate some redundant accesses to the page's object
field. (These accesses should have neen eliminated by r207410.)

Make the assertion in vm_page_flag_set() stricter. Specifically, only
managed pages should have PG_WRITEABLE set.

Add a comment documenting an assertion to vm_page_flag_clear().

It has long been the case that fictitious pages have their wire count
permanently set to one. Add comments to vm_page_wire() and
vm_page_unwire() documenting this. Add assertions to these functions
as well.

Update the comment describing vm_page_unwire(). Much of the old
comment had little to do with vm_page_unwire(), but a lot to do with
_vm_page_deactivate(). Move relevant parts of the old comment to
_vm_page_deactivate().

Only pages that belong to an object can be paged out. Therefore, it
is pointless for vm_page_unwire() to acquire the page queues lock and
enqueue such pages in one of the paging queues. Generally speaking,
such pages are immediately freed after the call to vm_page_unwire().
Previously, it was the call to vm_page_free() that reacquired the page
queues lock and removed these pages from the paging queues. Now, we
will never acquire the page queues lock for this case. (It is also
worth noting that since both vm_page_unwire() and vm_page_free()
occurred with the page locked, the page daemon never saw the page with
its object field set to NULL.)

Change the panic with vm_page_unwire() to provide a more precise message.

Reviewed by: kib@


209059 11-Jun-2010 jhb

Update several places that iterate over CPUs to use CPU_FOREACH().


208990 10-Jun-2010 alc

Reduce the scope of the page queues lock and the number of
PG_REFERENCED changes in vm_pageout_object_deactivate_pages().
Simplify this function's inner loop using TAILQ_FOREACH(), and shorten
some of its overly long lines. Update a stale comment.

Assert that PG_REFERENCED may be cleared only if the object containing
the page is locked. Add a comment documenting this.

Assert that a caller to vm_page_requeue() holds the page queues lock,
and assert that the page is on a page queue.

Push down the page queues lock into pmap_ts_referenced() and
pmap_page_exists_quick(). (As of now, there are no longer any pmap
functions that expect to be called with the page queues lock held.)

Neither pmap_ts_referenced() nor pmap_page_exists_quick() should ever
be passed an unmanaged page. Assert this rather than returning "0"
and "FALSE" respectively.

ARM:

Simplify pmap_page_exists_quick() by switching to TAILQ_FOREACH().

Push down the page queues lock inside of pmap_clearbit(), simplifying
pmap_clear_modify(), pmap_clear_reference(), and pmap_remove_write().
Additionally, this allows for avoiding the acquisition of the page
queues lock in some cases.

PowerPC/AIM:

moea*_page_exits_quick() and moea*_page_wired_mappings() will never be
called before pmap initialization is complete. Therefore, the check
for moea_initialized can be eliminated.

Push down the page queues lock inside of moea*_clear_bit(),
simplifying moea*_clear_modify() and moea*_clear_reference().

The last parameter to moea*_clear_bit() is never used. Eliminate it.

PowerPC/BookE:

Simplify mmu_booke_page_exists_quick()'s control flow.

Reviewed by: kib@


208794 04-Jun-2010 jchandra

Make vm_contig_grow_cache() extern, and use it when vm_phys_alloc_contig()
fails to allocate MIPS page table pages. The current usage of VM_WAIT in
case of vm_phys_alloc_contig() failure is not correct, because:

"There is no guarantee that any of the available free (or cached) pages
after the VM_WAIT will fall within the range of suitable physical
addresses. Every time this function sleeps and a single page is freed
(or cached) by someone else, this function will be reawakened. With
a little bad luck, you could spin indefinitely."

We also add low and high parameters to vm_contig_grow_cache() and
vm_contig_launder() so that we restrict vm_contig_launder() to the range
of pages we are interested in.

Reported by: alc

Reviewed by: alc
Approved by: rrs (mentor)


208791 03-Jun-2010 kib

Do not leak vm page lock in vm_contig_launder(), vm_pageout_page_lock()
always returns with the page locked.

Submitted by: alc
Pointy hat to: kib


208772 03-Jun-2010 kib

Add assertion and comment in vm_page_flag_set() describing the expectations
when the PG_WRITEABLE flag is set.

Reviewed by: alc


208764 03-Jun-2010 alc

Maintain the pretense that we support 32KB pages for the sake of the ia64
LINT build.


208745 02-Jun-2010 alc

Minimize the use of the page queues lock for synchronizing access to the
page's dirty field. With the exception of one case, access to this field
is now synchronized by the object lock.


208645 29-May-2010 alc

When I pushed down the page queues lock into pmap_is_modified(), I created
an ordering dependence: A pmap operation that clears PG_WRITEABLE and calls
vm_page_dirty() must perform the call first. Otherwise, pmap_is_modified()
could return FALSE without acquiring the page queues lock because the page
is not (currently) writeable, and the caller to pmap_is_modified() might
believe that the page's dirty field is clear because it has not seen the
effect of the vm_page_dirty() call.

When I pushed down the page queues lock into pmap_is_modified(), I
overlooked one place where this ordering dependence is violated:
pmap_enter(). In a rare situation pmap_enter() can be called to replace a
dirty mapping to one page with a mapping to another page. (I say rare
because replacements generally occur as a result of a copy-on-write fault,
and so the old page is not dirty.) This change delays clearing PG_WRITEABLE
until after vm_page_dirty() has been called.

Fixing the ordering dependency also makes it easy to introduce a small
optimization: When pmap_enter() used to replace a mapping to one page with a
mapping to another page, it freed the pv entry for the first mapping and
later called the pv entry allocator for the new mapping. Now, pmap_enter()
attempts to recycle the old pv entry, saving two calls to the pv entry
allocator.

There is no point in setting PG_WRITEABLE on unmanaged pages, so don't.
Update a comment to reflect this.

Tidy up the variable declarations at the start of pmap_enter().


208574 26-May-2010 alc

Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced(). Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively. In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting. This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages(). This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition. Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by: kib


208524 25-May-2010 alc

Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages(). It is no longer needed.

Submitted by: kib


208504 24-May-2010 alc

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


208340 20-May-2010 kib

When waiting for the busy page, do not unlock the object unless unlock
cannot be avoided.

Reviewed by: alc
MFC after: 1 week


208264 18-May-2010 alc

The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty(). Perform some style clean up while I'm here.

Reviewed by: kib


208175 16-May-2010 alc

On entry to pmap_enter(), assert that the page is busy. While I'm
here, make the style of assertion used by pmap_enter() consistent
across all architectures.

On entry to pmap_remove_write(), assert that the page is neither
unmanaged nor fictitious, since we cannot remove write access to
either kind of page.

With the push down of the page queues lock, pmap_remove_write() cannot
condition its behavior on the state of the PG_WRITEABLE flag if the
page is busy. Assert that the object containing the page is locked.
This allows us to know that the page will neither become busy nor will
PG_WRITEABLE be set on it while pmap_remove_write() is running.

Correct a long-standing bug in vm_page_cowsetup(). We cannot possibly
do copy-on-write-based zero-copy transmit on unmanaged or fictitious
pages, so don't even try. Previously, the call to pmap_remove_write()
would have failed silently.


208164 16-May-2010 alc

Correct an error of omission in r202897: Now that amd64 uses the direct map
to access the message buffer, we must explicitly request that the underlying
physical pages are included in a crash dump.

Reported by: Benjamin Kaduk


208159 16-May-2010 alc

Add a comment about the proper use of vm_object_page_remove().

MFC after: 1 week


207905 11-May-2010 alc

Update synchronization annotations for struct vm_page. Add a comment
explaining how the setting of PG_WRITEABLE is synchronized.


207846 10-May-2010 kib

Continue cleaning the queue instead of moving to the next queue or
bailing out if acquisition of page lock caused page position in the
queue to change.

Pointed out by: alc


207823 09-May-2010 alc

Push down the acquisition of the page queues lock into vm_pageq_remove().
(This eliminates a surprising number of page queues lock acquisitions by
vm_fault() because the page's queue is PQ_NONE and thus the page queues
lock is not needed to remove the page from a queue.)


207822 09-May-2010 alc

Call vm_page_deactivate() rather than vm_page_dontneed() in
swp_pager_force_pagein(). By dirtying the page, swp_pager_force_pagein()
forces vm_page_dontneed() to insert the page at the head of the inactive
queue, just like vm_page_deactivate() does. Moreover, because the page
was invalid, it can't have been mapped, and thus the other effect of
vm_page_dontneed(), clearing the page's reference bits has no effect. In
summary, there is no reason to call vm_page_dontneed() since its effect
will be identical to calling the simpler vm_page_deactivate().


207806 09-May-2010 alc

Remove the page queues lock around a call to vm_page_activate(). Make the
page dirty before adding it to the active queue.


207798 08-May-2010 alc

Minimize the scope of the page queues lock in vm_fault().


207796 08-May-2010 alc

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


207759 07-May-2010 jkim

Fix a typo in the previous commit.


207752 07-May-2010 kib

One more use for vm_pageout_init_marker().

Reviewed by: alc


207747 07-May-2010 alc

Eliminate unnecessary page queues locking.


207746 07-May-2010 alc

Push down the page queues lock into vm_page_activate().


207740 07-May-2010 alc

Update the synchronization requirements for the page usage count.


207739 07-May-2010 alc

Eliminate acquisitions of the page queues lock that are no longer needed.

Switch to a per-processor counter for the number of pages freed during
process termination.


207738 07-May-2010 alc

Push down the page queues lock into vm_page_deactivate(). Eliminate an
incorrect comment.


207728 06-May-2010 alc

Eliminate page queues locking around most calls to vm_page_free().


207706 06-May-2010 alc

Update a comment to say that access to a page's wire count is now
synchronized by the page lock.


207702 06-May-2010 alc

Push down the page queues lock inside of vm_page_free_toq() and
pmap_page_is_mapped() in preparation for removing page queues locking
around calls to vm_page_free(). Setting aside the assertion that calls
pmap_page_is_mapped(), vm_page_free_toq() now acquires and holds the page
queues lock just long enough to actually add or remove the page from the
paging queues.

Update vm_page_unhold() to reflect the above change.


207694 06-May-2010 kib

Add a helper function vm_pageout_page_lock(), similar to tegge'
vm_pageout_fallback_object_lock(), to obtain the page lock
while having page queue lock locked, and still maintain the
page position in a queue.

Use the helper to lock the page in the pageout daemon and contig launder
iterators instead of skipping the page if its lock is contested.
Skipping locked pages easily causes pagedaemon or launder to not make a
progress with page cleaning.

Proposed and reviewed by: alc


207669 05-May-2010 alc

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


207644 05-May-2010 alc

Push down the acquisition of the page queues lock into vm_page_unwire().

Update the comment describing which lock should be held on entry to
vm_page_wire().

Reviewed by: kib


207617 04-May-2010 alc

Add page locking to the vm_page_cow* functions.

Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by: kib


207601 04-May-2010 alc

Add lock assertions.


207580 03-May-2010 kib

Handle busy status of the page in a way expected for pager_getpage().
Flush requested page, unbusy other pages, do not clear m->busy.

Reviewed by: alc
MFC after: 1 week


207577 03-May-2010 alc

Acquire the page lock around vm_page_wire() in vm_page_grab().

Assert that the page lock is held in vm_page_wire().


207576 03-May-2010 alc

It makes more sense for the object-based backend allocator to use OBJT_PHYS
objects instead of OBJT_DEFAULT objects because we never reclaim or pageout
the allocated pages. Moreover, they are mapped with pmap_qenter(), which
creates unmanaged mappings.

Reviewed by: kib


207552 03-May-2010 alc

The pages allocated by kmem_alloc_attr() and kmem_malloc() are unmanaged.
Consequently, neither the page lock nor the page queues lock is needed to
unwire and free them.


207551 03-May-2010 alc

Assert that the page queues lock is held in vm_page_remove() and
vm_page_unwire() only if the page is managed, i.e., pageable.


207544 02-May-2010 alc

Add page lock assertions where we access the page's hold_count.


207541 02-May-2010 alc

Eliminate an assignment that was made redundant by r207410.


207540 02-May-2010 alc

Defer the acquisition of the page and page queues locks in
vm_pageout_object_deactivate_pages().


207539 02-May-2010 alc

Simplify vm_fault(). The introduction of the new page lock renders a bit of
cleverness by vm_fault() to avoid repeatedly releasing and reacquiring the
page queues lock pointless.

Reviewed by: kib, kmacy


207531 02-May-2010 alc

Correct an error in r207410: Remove an unlock of a lock that is no longer
held.


207530 02-May-2010 alc

It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(),
to unconditionally set PG_REFERENCED on a page before sleeping. In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable. Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.


207519 02-May-2010 alc

This change addresses the race condition that was introduced by the previous
revision, r207450, to this file. Specifically, between dropping the page
queues lock in vm_contig_launder() and reacquiring it in
vm_contig_launder_page(), the page may be removed from the active or
inactive queue. It could be wired, freed, cached, etc. None of which
vm_contig_launder_page() is prepared for.

Reviewed by: kib, kmacy


207487 02-May-2010 alc

Correct an error of omission in r206819. If VMFS_TLB_ALIGNED_SPACE is
specified to vm_map_find(), then retry the vm_map_findspace() if
vm_map_insert() fails because the aligned space is already partly used.

Reported by: Neel Natu


207460 01-May-2010 kmacy

Update locking comment above vm_page:
- re-assign page queue lock "Q"
- assign page lock "P"
- update several uncommented fields
- observe that hold_count is now protected by the page lock "P"


207452 30-Apr-2010 kmacy

push up dropping of the page queue lock to avoid holding it in vm_pageout_flush


207451 30-Apr-2010 kmacy

don't call vm_pageout_flush with the page queue mutex held

Reported by: Michael Butler


207450 30-Apr-2010 kmacy

- acquire the page lock in vm_contig_launder_page before checking page fields
- release page queue lock before calling vm_pageout_flush


207448 30-Apr-2010 kmacy

- don't check hold_count without the page lock held
- don't leak the page lock if m->object is NULL
(assuming that that check will in fact even be valid when m->object is protected by the page lock)


207438 30-Apr-2010 kib

Unlock page lock instead of recursively locking it.


207412 30-Apr-2010 kmacy

don't allow unsynchronized free in vm_page_unhold


207410 30-Apr-2010 kmacy

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


207374 29-Apr-2010 alc

Simplify the inner loop of vm_pageout_object_deactivate_pages(). Rather
than checking each page for PG_UNMANAGED, check the vm object's type.
Only OBJT_PHYS can have unmanaged pages. Eliminate a pointless counter.
The vm object is locked, that lock is never released by the inner loop,
and the set of pages contained by the vm object is not changed by the
inner loop. Therefore, the counter serves no purpose.


207365 29-Apr-2010 kib

When doing kstack swapin, read as much pages in one run as possible.

Suggested and reviewed by: alc (previous version)
Tested by: pho
MFC after: 2 weeks


207364 29-Apr-2010 kib

In swap pager, do not free the non-requested pages from the run if they are
wired. Kstack pages are wired, this change prepares swap pager for handling
of long runs of kstack pages.

Noted and reviewed by: alc
Tested by: pho
MFC after: 2 weeks


207308 28-Apr-2010 alc

Setting PG_REFERENCED on a page at the end of vm_fault() is redundant since
the page table entry's accessed bit is either preset by the immediately
preceding call to pmap_enter() or by hardware (or software) upon return
from vm_fault() when the faulting access is restarted.


207306 28-Apr-2010 alc

Change vm_object_madvise() so that it checks whether the page is invalid
or unmanaged before acquiring the page queues lock. Neither of these
tests require that lock. Moreover, a better way of testing if the page
is unmanaged is to test the type of vm object. This avoids a pointless
vm_page_lookup().

MFC after: 3 weeks


207155 24-Apr-2010 alc

Resurrect pmap_is_referenced() and use it in mincore(). Essentially,
pmap_ts_referenced() is not always appropriate for checking whether or
not pages have been referenced because it clears any reference bits
that it encounters. For example, in mincore(), clearing the reference
bits has two negative consequences. First, it throws off the activity
count calculations performed by the page daemon. Specifically, a page
on which mincore() has called pmap_ts_referenced() looks less active
to the page daemon than it should. Consequently, the page could be
deactivated prematurely by the page daemon. Arguably, this problem
could be fixed by having mincore() duplicate the activity count
calculation on the page. However, there is a second problem for which
that is not a solution. In order to clear a reference on a 4KB page,
it may be necessary to demote a 2/4MB page mapping. Thus, a mincore()
by one process can have the side effect of demoting a superpage
mapping within another process!


206885 20-Apr-2010 alc

Eliminate an unnecessary call to pmap_remove_all(). If a page belongs to
an object whose reference count is zero, then that page cannot possibly
be mapped.


206823 19-Apr-2010 alc

vm_thread_swapout() can safely dirty the page before rather than after
acquiring the page queues lock.


206819 18-Apr-2010 jmallett

o) Add a VM find-space option, VMFS_TLB_ALIGNED_SPACE, which searches the
address space for an address as aligned by the new pmap_align_tlb()
function, which is for constraints imposed by the TLB. [1]
o) Add a kmem_alloc_nofault_space() function, which acts like
kmem_alloc_nofault() but allows the caller to specify which find-space
option to use. [1]
o) Use kmem_alloc_nofault_space() with VMFS_TLB_ALIGNED_SPACE to allocate the
kernel stack address on MIPS. [1]
o) Make pmap_align_tlb() on MIPS align addresses so that they do not start on
an odd boundary within the TLB, so that they are suitable for insertion as
wired entries and do not have to share a TLB entry with another mapping,
assuming they are appropriately-sized.
o) Eliminate md_realstack now that the kstack will be appropriately-aligned on
MIPS.
o) Increase the number of guard pages to 2 so that we retain the proper
alignment of the kstack address.

Reviewed by: [1] alc
X-MFC-after: Making sure alc has not come up with a better interface.


206814 18-Apr-2010 alc

Remove a nonsensical test from vm_pageout_clean(). A page can't be in the
inactive queue and have a non-zero wire count.

Reviewed by: kib
MFC after: 3 weeks


206801 18-Apr-2010 alc

There is no justification for vm_object_split() setting PG_REFERENCED on a
page that it is going to sleep on. Eliminate it.

MFC after: 3 weeks


206770 17-Apr-2010 alc

In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on
that page only makes sense if the advice is MADV_WILLNEED. In that case,
the intention is to activate the page, so discouraging the page daemon
from reclaiming the page makes sense. In contrast, in the other cases,
MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage
the page daemon from reclaiming the page by setting PG_REFERENCED.

Wrap a nearby line.

Discussed with: kib
MFC after: 3 weeks


206768 17-Apr-2010 alc

In vm_object_backing_scan(), setting PG_REFERENCED on a page before
sleeping on that page is nonsensical. Doing so reduces the likelihood
that the page daemon will reclaim the page before the thread waiting in
vm_object_backing_scan() is reawakened. However, it does not guarantee
that the page is not reclaimed, so vm_object_backing_scan() restarts
after reawakening. More importantly, this muddles the meaning of
PG_REFERENCED. There is no reason to believe that the caller of
vm_object_backing_scan() is going to use (i.e., access) the contents of
the page. There is especially no reason to believe that an access is
more likely because vm_object_backing_scan() had to sleep on the page.

Discussed with: kib
MFC after: 3 weeks


206761 17-Apr-2010 alc

Setting PG_REFERENCED on the requested page in swap_pager_getpages() is
either redundant or harmful, depending on the caller. For example, when
called by vm_fault(), it is redundant. However, when called by
vm_thread_swapin(), it is harmful. Specifically, if the thread is later
swapped out, having PG_REFERENCED set on its stack pages leads the page
daemon to reactivate these stack pages and delay their reclamation.

Reviewed by: kib
MFC after: 3 weeks


206545 13-Apr-2010 alc

Simplify vm_thread_swapin().


206483 11-Apr-2010 alc

Initialize the virtual memory-related resource limits in a single place.
Previously, one of these limits was initialized in two places to a
different value in each place. Moreover, because an unsigned int was used
to represent the amount of pageable physical memory, some of these limits
were incorrectly initialized on 64-bit architectures. (Currently, this
error is masked by login.conf's default settings.)

Make vm_thread_swapin() and vm_thread_swapout() static.

Submitted by: bde (an earlier version)
Reviewed by: kib


206409 09-Apr-2010 alc

Introduce the function kmem_alloc_attr(), which allocates kernel virtual
memory with the specified physical attributes. In particular, like
kmem_alloc_contig(), the caller can specify the physical address range
from which the physical pages are allocated and the memory attributes
(i.e., cache behavior) for these physical pages. However, in contrast to
kmem_alloc_contig() or contigmalloc(), the physical pages that are
allocated by kmem_alloc_attr() are not necessarily physically contiguous.
This function is needed by DRM and VirtualBox.

Correct an error in the prototype for kmem_malloc(). The third argument
had the wrong type.

Tested by: rnoland
MFC after: 3 days


206360 07-Apr-2010 joel

Start copyright notice with /*-


206264 06-Apr-2010 kib

When OOM searches for a process to kill, ignore the processes already
killed by OOM. When killed process waits for a page allocation, try to
satisfy the request as fast as possible.

This removes the often encountered deadlock, where OOM continously
selects the same victim process, that sleeps uninterruptibly waiting
for a page. The killed process may still sleep if page cannot be
obtained immediately, but testing has shown that system has much
higher chance to survive in OOM situation with the patch.

In collaboration with: pho
Reviewed by: alc
MFC after: 4 weeks


206174 05-Apr-2010 alc

vm_reserv_alloc_page() should never be called on an OBJT_SG object, just as
it is never called on an OBJT_DEVICE object. (This change should have been
included in r195840.)

Reported by: dougb@, avg@
MFC after: 3 days


206142 03-Apr-2010 alc

Make _vm_map_init() the one place where the vm map's pmap field is
initialized.

Reviewed by: kib


206140 03-Apr-2010 alc

Re-enable the call to pmap_release() by vmspace_dofree(). The accounting
problem that is described in the comment has been addressed.

Submitted by: kib
Tested by: pho (a few months ago)
MFC after: 6 weeks


205536 23-Mar-2010 jhb

Reject attempts to create a MAP_ANON mapping with a non-zero offset.

PR: kern/71258
Submitted by: Alexander Best
MFC after: 2 weeks


205487 22-Mar-2010 kmacy

- enable alignment on amd64 only
- only align pcpu caches and the volatile portion of uma_zone


205298 18-Mar-2010 kmacy

turn 205266 in to a no-op until the problem can be properly diagnosed


205266 17-Mar-2010 kmacy

Cache line align various structures and move volatile counters to
not share a cache line with (mostly) immutable state

Reviewed by: jeff@
MFC after: 7 days


204415 27-Feb-2010 kib

Update comment for vm_page_alloc(9), listing all acceptable flags [1].
Note that the function does not sleep, it can block.

Submitted by: Giovanni Trematerra <giovanni.trematerra gmail com> [1]
MFC after: 3 days


204205 22-Feb-2010 kib

Remove write-only variable.

MFC after: 3 days


204181 21-Feb-2010 alc

Align the start of the clean submap to a superpage boundary. Although
no superpage mappings are created within the clean submap, aligning the
start of the clean submap helps to prevent interference with kmem_alloc()'s
use of superpages.


203175 29-Jan-2010 kib

The MAP_ENTRY_NEEDS_COPY flag belongs to protoeflags, cow variable
uses different namespace.

Reported by: Jonathan Anderson <jonathan.anderson cl cam ac uk>
MFC after: 3 days


202529 17-Jan-2010 kib

When a vnode-backed vm object is referenced, it increments the vnode
reference count, and decrements it on dereference. If referenced object
is deallocated, object type is reset to OBJT_DEAD. Consequently, all
vnode references that are owned by object references are never released.
vunref() the vnode in vm object deallocation code for OBJT_VNODE
appropriate number of times to prevent leak.

Add an assertion to the vm_pageout() to make sure that we never get
reference on the vnode but then do not execute code to release it.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


201223 29-Dec-2009 rnoland

Update d_mmap() to accept vm_ooffset_t and vm_memattr_t.

This replaces d_mmap() with the d_mmap2() implementation and also
changes the type of offset to vm_ooffset_t.

Purge d_mmap2().

All driver modules will need to be rebuilt since D_VERSION is also
bumped.

Reviewed by: jhb@
MFC after: Not in this lifetime...


201145 28-Dec-2009 antoine

(S)LIST_HEAD_INITIALIZER takes a (S)LIST_HEAD as an argument.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.

PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month


200770 21-Dec-2009 kib

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


200129 05-Dec-2009 antoine

Remove trailing ";" in UMA_HASH_INSERT and UMA_HASH_REMOVE macros.

MFC after: 1 month


199870 28-Nov-2009 alc

Properly synchronize the previous change.


199869 27-Nov-2009 alc

Support the new VM_PROT_COPY option on wired pages. The effect of which
is that a debugger can now set a breakpoint in a program that uses mlock(2)
on its text segment or mlockall(2) on its entire address space.


199868 27-Nov-2009 alc

Simplify the invocation of vm_fault(). Specifically, eliminate the flag
VM_FAULT_DIRTY. The information provided by this flag can be trivially
inferred by vm_fault().

Discussed with: kib


199819 26-Nov-2009 alc

Replace VM_PROT_OVERRIDE_WRITE by VM_PROT_COPY. VM_PROT_OVERRIDE_WRITE has
represented a write access that is allowed to override write protection.
Until now, VM_PROT_OVERRIDE_WRITE has been used to write breakpoints into
text pages. Text pages are not just write protected but they are also
copy-on-write. VM_PROT_OVERRIDE_WRITE overrides the write protection on the
text page and triggers the replication of the page so that the breakpoint
will be written to a private copy. However, here is where things become
confused. It is the debugger, not the process being debugged that requires
write access to the copied page. Nonetheless, the copied page is being
mapped into the process with write access enabled. In other words, once the
debugger sets a breakpoint within a text page, the program can write to its
private copy of that text page. Whereas prior to setting the breakpoint, a
SIGSEGV would have occurred upon a write access. VM_PROT_COPY addresses
this problem. The combination of VM_PROT_READ and VM_PROT_COPY forces the
replication of a copy-on-write page even though the access is only for read.
Moreover, the replicated page is only mapped into the process with read
access, and not write access.

Reviewed by: kib
MFC after: 4 weeks


199490 18-Nov-2009 alc

Simplify both the invocation and the implementation of vm_fault() for wiring
pages.

(Note: Claims made in the comments about the handling of breakpoints in
wired pages have been false for roughly a decade. This and another bug
involving breakpoints will be fixed in coming changes.)

Reviewed by: kib


198870 04-Nov-2009 alc

Eliminate an unnecessary #include. (This #include should have been removed
in r188331 when vnode_pager_lock() was eliminated.)


198855 03-Nov-2009 alc

Eliminate a bit of hackery from vm_fault(). The operations that this
hackery sought to prevent are now properly supported by vm_map_protect().
(See r198505.)

Reviewed by: kib


198854 03-Nov-2009 attilio

Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>


198812 02-Nov-2009 alc

Avoid pointless calls to pmap_protect().

Reviewed by: kib


198811 02-Nov-2009 ivoras

Add sysctl documentation strings. The descriptions are derived
from tuning(7). One of the descriptions references tuning(7) because
it is too complex to adequatly describe here (it is not a simple
boolean sysctl) and users should be warned to that.

Reviewed by: alc, kib
Approved by: gnn (mentor)


198721 31-Oct-2009 alc

Correct an error in vm_fault_copy_entry() that has existed since the first
version of this file. When a process forks, any wired pages are immediately
copied because copy-on-write is not supported for wired pages. In other
words, the child process is given its own private copy of each wired page
from its parent's address space. Unfortunately, to date, these copied pages
have been mapped into the child's address space with the wrong permissions,
typically VM_PROT_ALL. This change corrects the permissions.

Reviewed by: kib


198505 27-Oct-2009 kib

When protection of wired read-only mapping is changed to read-write,
install new shadow object behind the map entry and copy the pages
from the underlying objects to it. This makes the mprotect(2) call to
actually perform the requested operation instead of silently do nothing
and return success, that causes SIGSEGV on later write access to the
mapping.

Reuse vm_fault_copy_entry() to do the copying, modifying it to behave
correctly when src_entry == dst_entry.

Reviewed by: alc
MFC after: 3 weeks


198476 26-Oct-2009 alc

Simplify the inner loop of vm_fault_copy_entry().

Reviewed by: kib


198472 25-Oct-2009 alc

Eliminate an unnecessary check from vm_fault_prefault().


198341 21-Oct-2009 marcel

o Introduce vm_sync_icache() for making the I-cache coherent with
the memory or D-cache, depending on the semantics of the platform.
vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
that translates the vm_map_t argumument to pmap_t.
o Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
it replaces the pmap_page_executable() function, added to solve
the I-cache problem in uiomove_fromphys().
o In proc_rwmem() call vm_sync_icache() when writing to a page that
has execute permissions. This assures that when breakpoints are
written, the I-cache will be coherent and the process will actually
hit the breakpoint.
o This also fixes the Book-E PMAP implementation that was missing
necessary locking while trying to deal with the I-cache coherency
in pmap_enter() (read: mmu_booke_enter_locked).

The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.


198201 18-Oct-2009 kib

Remove spurious call to priv_check(PRIV_VM_SWAP_NOQUOTA).
Call priv_check(PRIV_VM_SWAP_NORLIMIT) only when per-uid limit is
actually exceed.

Both changes aim at calling priv_check(9) only for the cases when
privilege is actually exercised by the process.

Reported and tested by: rwatson
Reviewed by: alc
MFC after: 3 days


197750 04-Oct-2009 alc

Align and pad the page queue and free page queue locks so that the linker
can't possibly place them together within the same cache line.

MFC after: 3 weeks


197712 02-Oct-2009 bz

Back out the functional parts from r197537. After r197711, affecting all
user mappings, mmap no longer needs special treatment.


197661 01-Oct-2009 kib

Move the annotation for vm_map_startup() immediately before the function.

MFC after: 3 days


197537 27-Sep-2009 simon

Do not allow mmap with the MAP_FIXED argument to map at address zero.
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities. While this of course does not fix vulnerabilities,
it does mitigate their impact.

Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.

This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.

Discussed with: rwatson, bz
Tested by: bz (Wine), simon (VirtualBox)
Submitted by: jhb


197348 20-Sep-2009 kib

Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.

For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Reported by: tegge
Reviewed by: tegge, alc, jhb
MFC after: 3 days


196730 01-Sep-2009 kib

Reintroduce the r196640, after fixing the problem with my testing.

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week


196648 29-Aug-2009 kib

Reverse r196640 and r196644 for now.


196640 29-Aug-2009 kib

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week


196637 29-Aug-2009 jhb

Mark the fake pages constructed by the OBJT_SG pager valid. This was
accidentally lost at one point during the PAT development. Without this
fix vm_pager_get_pages() was zeroing each of the pages.

Submitted by: czander @ NVidia
MFC after: 3 days


196615 28-Aug-2009 jhb

Extend the device pager to support different memory attributes on different
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
an additional in/out parameter that is the memory attribute to use for
the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
that the driver provides a d_mmap2() handler instead of d_mmap(). This
is done to make the change ABI compatible with existing drivers and
MFC'able to 7 and 8.

Submitted by: alc
MFC after: 1 month


195844 24-Jul-2009 jhb

Remove debugging that crept in with previous commit.

Reported by: nwhitehorn
Approved by: re (kib)


195840 24-Jul-2009 jhb

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


195774 19-Jul-2009 alc

Change the handling of fictitious pages by pmap_page_set_memattr() on
amd64 and i386. Essentially, fictitious pages provide a mechanism for
creating aliases for either normal or device-backed pages. Therefore,
pmap_page_set_memattr() on a fictitious page needn't update the direct
map or flush the cache. Such actions are the responsibility of the
"primary" instance of the page or the device driver that "owns" the
physical address. For example, these actions are already performed by
pmap_mapdev().

The device pager needn't restore the memory attributes on a fictitious
page before releasing it. It's now pointless.

Add pmap_page_set_memattr() to the Xen pmap.

Approved by: re (kib)


195749 18-Jul-2009 alc

An addendum to r195649, "Add support to the virtual memory system for
configuring machine-dependent memory attributes...":

Don't set the memory attribute for a "real" page that is allocated to
a device object in vm_page_alloc(). It is a pointless act, because
the device pager replaces this "real" page with a "fake" page and sets
the memory attribute on that "fake" page.

Eliminate pointless code from pmap_cache_bits() on amd64.

Employ the "Self Snoop" feature supported by some x86 processors to
avoid cache flushes in the pmap.

Approved by: re (kib)


195693 14-Jul-2009 jhb

- Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.

Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week


195649 12-Jul-2009 alc

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


195635 12-Jul-2009 kib

When VM_MAP_WIRE_HOLESOK is not specified and vm_map_wire(9) encounters
non-readable and non-executable map entry, the entry is skipped from
wiring and loop is aborted. But, since MAP_ENTRY_WIRE_SKIPPED was not
set for the map entry, its wired_count is later erronously decremented.
vm_map_delete(9) for such map entry stuck in "vmmaps".

Properly set MAP_ENTRY_WIRE_SKIPPED when aborting the loop.

Reported by: John Marshall <john.marshall riverwillow com au>
Approved by: re (kensmith)


195329 03-Jul-2009 kib

When forking a vm space that has wired map entries, do not forget to
charge the objects created by vm_fault_copy_entry. The object charge
was set, but reserve not incremented.

Reported by: Greg Rivers <gcr+freebsd-current tharned org>
Reviewed by: alc (previous version)
Approved by: re (kensmith)


195131 28-Jun-2009 kib

Eliminiate code duplication by calling vm_object_destroy()
from vm_object_collapse().

Requested and reviewed by: alc
Approved by: re (kensmith)


195033 26-Jun-2009 alc

This change is the next step in implementing the cache control functionality
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with: jhb


194990 25-Jun-2009 kib

Change the type of uio_resid member of struct uio from int to ssize_t.
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.

Tested by: pho
Reviewed by: jhb, bde (as part of the review of the bigger patch)


194814 24-Jun-2009 kib

Initialize the uip to silence gcc warning that seems to sneak in in some
build environments.

Reported by: alc, bf1783 at googlemail com


194806 24-Jun-2009 alc

The bits set in a page's dirty mask are a subset of the bits set in its
valid mask. Consequently, there is no need to perform a bit-wise and of
the page's dirty and valid masks in order to determine which parts of a
page are dirty and valid.

Eliminate an unnecessary #include.


194766 23-Jun-2009 kib

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


194642 22-Jun-2009 alc

Validate the page in one place, dev_pager_getpages(), rather than doing it
in two places, dev_pager_getfake() and dev_pager_updatefake().

Compare a pointer to "NULL" rather than "0".


194607 21-Jun-2009 alc

Implement a mechanism within vm_phys_alloc_contig() to defer all necessary
calls to vdrop() until after the free page queues lock is released. This
eliminates repeatedly releasing and reacquiring the free page queues lock
each time the last cached page is reclaimed from a vnode-backed object.


194562 21-Jun-2009 alc

Strive for greater consistency among the places that implement real,
fictious, and contiguous page allocation. Eliminate unnecessary
reinitialization of a page's fields.


194459 18-Jun-2009 thompsa

Track the kernel mapping of a physical page by a new entry in vm_page
structure. When the page is shared, the kernel mapping becomes a special
type of managed page to force the cache off the page mappings. This is
needed to avoid stale entries on all ARM VIVT caches, and VIPT caches
with cache color issue.

Submitted by: Mark Tinguely
Reviewed by: alc
Tested by: Grzegorz Bernacki, thompsa


194429 18-Jun-2009 alc

Add support for UMA_SLAB_KERNEL to page_free(). (While I'm here remove an
unnecessary newline character from the end of two panic messages.)


194393 17-Jun-2009 alc

Eliminate unnecessary forward declarations.


194376 17-Jun-2009 alc

Refactor contigmalloc() into two functions: a simple front-end that deals
with the malloc tag and calls a new back-end, kmem_alloc_contig(), that
allocates the pages and maps them.

The motivations for this change are two-fold: (1) A cache mode parameter
will be added to kmem_alloc_contig(). In other words, kmem_alloc_contig()
will be extended to support the allocation of memory with caller-specified
caching. (2) The UMA allocation function that is used by the two jumbo
frames zones can use kmem_alloc_contig() in place of contigmalloc() and
thereby avoid having free jumbo frames held by the zone counted as live
malloc()ed memory.


194337 17-Jun-2009 alc

Pass the size of the mapping to contigmapping() as a "vm_size_t" rather
than a "vm_pindex_t". A "vm_size_t" is more convenient for it to use.


194331 17-Jun-2009 alc

Make the maintenance of a page's valid bits by contigmalloc() more like
kmem_alloc() and kmem_malloc(). Specifically, defer the setting of the
page's valid bits until contigmapping() when the mapping is known to be
successful.


194209 14-Jun-2009 alc

Long, long ago in r27464 special case code for mapping device-backed
memory with 4MB pages was added to pmap_object_init_pt(). This code
assumes that the pages of a OBJT_DEVICE object are always physically
contiguous. Unfortunately, this is not always the case. For example,
jhb@ informs me that the recently introduced /dev/ksyms driver creates
a OBJT_DEVICE object that violates this assumption. Thus, this
revision modifies pmap_object_init_pt() to abort the mapping if the
OBJT_DEVICE object's pages are not physically contiguous. This
revision also changes some inconsistent if not buggy behavior. For
example, the i386 version aborts if the first 4MB virtual page that
would be mapped is already valid. However, it incorrectly replaces
any subsequent 4MB virtual page mappings that it encounters,
potentially leaking a page table page. The amd64 version has a bug of
my own creation. It potentially busies the wrong page and always an
insufficent number of pages if it blocks allocating a page table page.

To my knowledge, there have been no reports of these bugs, hence,
their persistance. I suspect that the existing restrictions that
pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would
choose to map, for example, that the first page must be aligned on a 2
or 4MB physical boundary and that the size of the mapping must be a
multiple of the large page size, were enough to avoid triggering the
bug for drivers like ksyms. However, one side effect of testing the
OBJT_DEVICE object's pages for physical contiguity is that a dubious
difference between pmap_object_init_pt() and the standard path for
mapping devices pages, i.e., vm_fault(), has been eliminated.
Previously, pmap_object_init_pt() would only instantiate the first
PG_FICTITOUS page being mapped because it never examined the rest.
Now, however, pmap_object_init_pt() uses the new function
vm_object_populate() to instantiate them all (in order to support
testing their physical contiguity). These pages need to be
instantiated for the mechanism that I have prototyped for
automatically maintaining the consistency of the PAT settings across
multiple mappings, particularly, amd64's direct mapping, to work.
(Translation: This change is also being made to support jhb@'s work on
the Nvidia feature requests.)

Discussed with: jhb@


194126 13-Jun-2009 alc

Eliminate an unnecessary clearing of a page's dirty bits in
phys_pager_getpages().


193842 09-Jun-2009 alc

Eliminate an unnecessary restriction on the vm object type from
vm_map_pmap_enter(). The immediate effect of this change is that automatic
prefaulting by mmap() for small mappings is performed on POSIX shared memory
objects just the same as it is on ordinary files.


193643 07-Jun-2009 alc

Eliminate unnecessary obfuscation when testing a page's valid bits.


193594 06-Jun-2009 alc

Eliminate an unneeded forward declaration. (This should have been removed
in revision 1.42.)


193593 06-Jun-2009 alc

If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check
the page's valid bits. The page is guaranteed to be fully valid. (For the
record, this is documented in vm/vm_pager.h's comments.)


193522 05-Jun-2009 alc

vm_thread_swapin() needn't validate any pages. The pages are already
validated by vm_pager_get_pages().


193521 05-Jun-2009 alc

Simplify contigfree().


193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


193303 02-Jun-2009 alc

Correct a boundary case error in the management of a page's dirty bits by
shm_dotruncate() and vnode_pager_setsize(). Specifically, if the length of
a shared memory object or a file is truncated such that the length modulo
the page size is between 1 and 511, then all of the page's dirty bits were
cleared. Now, a dirty bit is cleared only if the corresponding block is
truncated in its entirety.


193275 01-Jun-2009 jhb

Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
added to cdevsw. This function is called for each mmap() request.
If it returns ENODEV, then the mmap() request will fall back to using
the device's device pager object and d_mmap(). Otherwise, the method
can return a VM object to satisfy this entire mmap() request via
*object. It can also modify the starting offset into this object via
*foff. This allows device drivers to use the file offset as a cookie
to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
mapping V_CHR vnodes. This avoids duplicating all the cdev mmap
handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02. Older device drivers
using D_VERSION_01 are still supported.

MFC after: 1 month


193126 30-May-2009 alc

Eliminate a stale comment and the two remaining uses of the "register"
keyword in this file.


193124 30-May-2009 alc

Add assertions in two places where a page's valid or dirty bits are changed.


192968 28-May-2009 alc

Change vm_object_page_remove() such that it clears the page's dirty bits
when it invalidates the page.

Suggested by: tegge


192962 28-May-2009 alc

Revise vm_pageout_scan()'s handling of partially dirty pages. Specifically,
rather than unconditionally making partially dirty pages fully dirty, only
make partially dirty pages fully dirty if the pmap says that the page has
been modified.

(This change is also a small optimization. It eliminate an unnecessary call
to pmap_is_modified() on pages that are mapped read only.)

Suggested by: tegge


192360 19-May-2009 kmacy

- back out direct map hack
- it is no longer needed


192261 17-May-2009 alc

Eliminate a pointless call to pmap_clear_reference() from vm_pageout_scan().
If the page belongs to an object with a reference count of zero, then it
can't have any managed mappings on which to clear a reference bit.


192207 16-May-2009 kmacy

apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map


192134 15-May-2009 alc

Eliminate unnecessary clearing of the page's dirty mask from various
getpages functions.

Eliminate a stale comment.


192034 13-May-2009 alc

Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by: tegge


192010 12-May-2009 alc

Eliminate gratuitous clearing of the page's dirty mask.


191935 09-May-2009 alc

Fix a race involving vnode_pager_input_smlfs(). Specifically, in the case
that vnode_pager_input_smlfs() zeroes the page, it should not mark the page
as valid until after the page is zeroed. Otherwise, the page could be
mapped for read access (e.g., by vm_map_pmap_enter()) before the page is
zeroed. Reviewed by: tegge

Eliminate gratuitous clearing of the page's dirty mask by
vnode_pager_input_smlfs(). Instead, assert that the page is clean.
Reviewed by: tegge

Eliminate some blank lines.

Eliminate pointless calls to pmap_clear_modify() and vm_page_undirty() from
vnode_pager_input_old(). The page is not mapped. Therefore, it cannot have
any page table entries that are modified.

Eliminate an incorrect comment from vnode_pager_generic_getpages().


191874 07-May-2009 alc

Eliminate an incorrect comment.


191778 04-May-2009 alc

Eliminate vnode_pager_input_smlfs()'s pointless call to pmap_clear_modify().
The page can't possibly have any modified page table entries because it
isn't even mapped.


191626 28-Apr-2009 kib

Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace in a place where it was missed in r191277.

Noted by: pluknet gmail com


191625 28-Apr-2009 kib

Fix typo.


191543 26-Apr-2009 alc

Eliminate an errant comment.

Discussed with: tegge


191531 26-Apr-2009 alc

Eliminate an archaic band-aid. The immediately preceding comment already
explains why the band-aid is unnecessary.

Suggested by: tegge


191478 25-Apr-2009 alc

Eliminate unnecessary calls to pmap_clear_modify(). Specifically, calling
pmap_clear_modify() on a page is pointless if that page is not mapped or
it is only mapped for read access. Instead, assert that the page is not
mapped or not mapped for write access as appropriate.

Eliminate unnecessary clearing of a page's dirty mask. Instead, assert
that the page's dirty mask is clear.


191439 23-Apr-2009 kib

Do not call vm_page_lookup() from the ddb routine, namely from "show
vmopag" implementation. The vm_page_lookup() code modifies splay tree
of the object pages, and asserts that object lock is taken. First issue
could cause kernel data corruption, and second one instantly panics the
INVARIANTS-enabled kernel.

Take the advantage of the fact that object->memq is ordered by page index,
and iterate over memq to calculate the runs.

While there, make the code slightly more style-compliant by moving
variables declarations to the right place.

Discussed with: jhb, alc
Reviewed by: alc
MFC after: 2 weeks


191277 19-Apr-2009 kib

In both pageout oom handler and vm_daemon, acquire the reference to
the vmspace of the examined process instead of directly accessing its
vmspace, that may change. Also, as an optimization, check for P_INEXEC
flag before examining the process.

Reported and tested by: pho (previous version)
Reviewed by: alc
MFC after: 3 week


191263 19-Apr-2009 alc

Calling pmap_clear_modify() after calling pmap_remove_write() is pointless.
The latter function already clears the modified status from each of the
page's mappings.


191256 19-Apr-2009 alc

Allow valid pages to be mapped for read access when they have a non-zero
busy count. Only mappings that allow write access should be prevented by
a non-zero busy count.

(The prohibition on mapping pages for read access when they have a non-
zero busy count originated in revision 1.202 of i386/i386/pmap.c when
this code was a part of the pmap.)

Reviewed by: tegge


190949 11-Apr-2009 alc

Remove execute permission from the memory allocated by sbrk().

Pre-announced on: -arch (3/31/09)
Discussed with: rwatson
Tested by: marius (sparc64)


190912 11-Apr-2009 alc

Previously, when vm_page_free_toq() was performed on a page belonging to
a reservation, unless all of the reservation's pages were free, the
reservation was moved to the head of the partially-populated reservations
queue, where it would be the next reservation to be broken in case the
free page queues were emptied. Now, instead, I am moving it to the tail.
Very likely this reservation is in the process of being freed in its
entirety, so placing it at the tail of the queue makes it more likely that
the underlying physical memory will be returned to the free page queues as
one contiguous chunk. If a reservation must be broken, it will, instead,
be the longest unchanged reservation, which is arguably the reservation
that is least likely to ever achieve promotion or be freed in its entirety.

MFC after: 6 weeks


190886 10-Apr-2009 kib

When vm_map_wire(9) is allowed to skip holes in the wired region, skip
the mappings without any of read and execution rights, in particular,
the PROT_NONE entries. This makes mlockall(2) work for the process
address space that has such mappings.

Since protection mode of the entry may change between setting
MAP_ENTRY_IN_TRANSITION and final pass over the region that records
the wire status of the entries, allocate new map entry flag
MAP_ENTRY_WIRE_SKIPPED to mark the skipped PROT_NONE entries.

Reported and tested by: Hans Ottevanger <fbsdhackers beasties demon nl>
Reviewed by: alc
MFC after: 3 weeks


190705 04-Apr-2009 alc

Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission. On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission. The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE. (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC. In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list. So,
the need for coalescing vm map entries is not as great as it once was.


190604 01-Apr-2009 alc

Eliminate dead code.

Reviewed by: jhb


189595 09-Mar-2009 jhb

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


189024 25-Feb-2009 alc

Prior to r188331 a map entry's last read offset was only updated by a hard
fault. In r188331 this update was relocated because of synchronization
changes to a place where it would occur on both hard and soft faults. This
change again restricts the update to hard faults.


189015 24-Feb-2009 kib

Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by: pho
Reviewed by: alc


189014 24-Feb-2009 kib

Add the assertion macros for the map locks. Use them in several map
manipulation functions.

Tested by: pho
Reviewed by: alc


189012 24-Feb-2009 kib

Update the comment after the r188334.

Reviewed by: alc


189004 24-Feb-2009 rdivacky

Change the functions to ANSI in those cases where it breaks promotion
to int rule. See ISO C Standard: SS6.7.5.3:15.

Approved by: kib (mentor)
Reviewed by: warner
Tested by: silence on -current


188967 23-Feb-2009 rwatson

Put debug.vm_lowmem sysctl under DIAGNOSTIC.

Submitted by: sam
MFC after: 3 days


188964 23-Feb-2009 rwatson

Add a debugging sysctl, debug.vm_lowmem, that when assigned a value of
1 will trigger a pass through the VM's low-memory handlers, such as
protocol and UMA drain routines. This makes it easier to exercise
these otherwise rarely-invoked code paths.

MFC after: 3 days


188900 21-Feb-2009 alc

Reduce the scope of the page queues lock in vm_object_page_remove().

MFC after: 1 week


188859 20-Feb-2009 alc

Eliminate stale comments.


188386 09-Feb-2009 kib

Comment out the assertion from r188321. It is not valid for nfs.

Reported by: alc


188383 09-Feb-2009 alc

Avoid some cases of unnecessary page queues locking by vm_fault's delete-
behind heuristic.


188348 08-Feb-2009 alc

Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a
redundant assertion in vm_fault().

Reviewed by: kib


188337 08-Feb-2009 kib

Remove no longer valid comment.

Submitted by: alc


188335 08-Feb-2009 kib

Improve comments, correct English.

Submitted by: alc


188334 08-Feb-2009 kib

Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by: tegge, alc
Tested by: pho


188333 08-Feb-2009 kib

In vm_map_sync(), do not call vm_object_sync() while holding map lock.
Reference object, drop the map lock, and then call vm_object_sync().
The object sync might require vnode lock for OBJT_VNODE type objects.

Reviewed by: tegge
Tested by: pho


188331 08-Feb-2009 kib

Do not sleep for vnode lock while holding map lock in vm_fault. Try to
acquire vnode lock for OBJT_VNODE object after map lock is dropped.
Because we have the busy page(s) in the object, sleeping there would
result in deadlock with vnode resize. Try to get lock without sleeping,
and, if the attempt failed, drop the state, lock the vnode, and restart
the fault handler from the start with already locked vnode.

Because the vnode_pager_lock() function is inlined in vm_fault(),
axe it.

Based on suggestion by: alc
Reviewed by: tegge, alc
Tested by: pho


188325 08-Feb-2009 kib

Add the comments to vm_map_simplify_entry() and vmspace_fork(),
describing why several calls to vm_deallocate_object() with locked map
do not result in the acquisition of the vnode lock after map lock.

Suggested and reviewed by: tegge


188323 08-Feb-2009 kib

Lock the new map in vmspace_fork(). The newly allocated map should not
be accessible outside vmspace_fork() yet, but locking it would satisfy
the protocol of the vm_map_entry_link() and other functions called
from vmspace_fork().

Use trylock that is supposedly cannot fail, to silence WITNESS warning
of the nested acquisition of the sx lock with the same name.

Suggested and reviewed by: tegge


188321 08-Feb-2009 kib

Assert that vnode is exclusively locked when its vm object is resized.

Reviewed by: tegge


188320 08-Feb-2009 kib

Do not leak the MAP_ENTRY_IN_TRANSITION flag when copying map entry
on fork. Otherwise, copied entry cannot be removed in the child map.

Reviewed by: tegge
MFC after: 2 weeks


188319 08-Feb-2009 kib

Style.


187681 25-Jan-2009 jeff

- Make the keg abstraction more complete. Permit a zone to have multiple
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.

Sponsored by: Nokia


187658 23-Jan-2009 jhb

- Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.


187527 21-Jan-2009 jhb

Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by: ups


186719 03-Jan-2009 kib

Extend the struct vm_page wire_count to u_int to avoid the overflow
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].

To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]

Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks


186665 01-Jan-2009 alc

Resurrect shared map locks allowing greater concurrency during some map
operations, such as page faults.

An earlier version of this change was ...

Reviewed by: kib
Tested by: pho
MFC after: 6 weeks


186633 31-Dec-2008 alc

Update or eliminate some stale comments.


186618 30-Dec-2008 alc

Avoid an unnecessary memory dereference in vm_map_entry_splay().


186616 30-Dec-2008 alc

Style change to vm_map_lookup(): Eliminate a macro of dubious value.


186609 30-Dec-2008 alc

Move the implementation of the vm map's fast path on address lookup from
vm_map_lookup{,_locked}() to vm_map_lookup_entry(). Having the fast path
in vm_map_lookup{,_locked}() limits its benefits to page faults. Moving
it to vm_map_lookup_entry() extends its benefits to other operations on
the vm map.


186374 21-Dec-2008 rnoland

Fix printing of KASSERT message missed in r163604.

Approved by: kib


185012 16-Nov-2008 kib

Instead of forcing vn_start_write() to reset mp back to NULL for the
failed calls with non-NULL vp, explicitely clear mp after failure.

Tested by: stass
Reviewed by: tegge
PR: 123768
MFC after: 1 week


184728 06-Nov-2008 raj

Support kernel crash mini dumps on ARM architecture.

Obtained from: Juniper Networks, Semihalf


184546 02-Nov-2008 keramida

Various comment nits, and typos.


184168 22-Oct-2008 rwatson

Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after: 3 days


183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


183474 29-Sep-2008 kib

Move the code for doing out-of-memory grass from vm_pageout_scan()
into the separate function vm_pageout_oom(). Supply a parameter for
vm_pageout_oom() describing a reason for the call.

Call vm_pageout_oom() from the swp_pager_meta_build() when swap zone
is exhausted.

Reviewed by: alc
Tested by: pho, jhb
MFC after: 2 weeks


183389 26-Sep-2008 emaste

Move CTASSERT from header file to source file, per implementation note now
in the CTASSERT man page.


183383 26-Sep-2008 kib

Save previous content of the td_fpop before storing the current
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().

Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.

Noted by: rwatson
Tested by: rnoland
MFC after: 3 days


183236 21-Sep-2008 alc

Prevent an integer overflow in vm_pageout_page_stats() on machines with a
large number of physical pages.

PR: 126158
Submitted by: Dmitry Tejblum
MFC after: 3 days


183216 20-Sep-2008 kib

Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with: rwatson, jhb, rnoland
Tested by: rnoland
MFC after: 3 days


182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


182047 23-Aug-2008 antoine

Remove unused variable nosleepwithlocks.

PR: 126609
Submitted by: Mateusz Guzik
MFC after: 1 month
X-MFC: to stable/7 only, this variable is still used in stable/6


182028 23-Aug-2008 nwhitehorn

Allow the MD UMA allocator to use VM routines like kmem_*(). Existing code requires MD allocator to be available early in the boot process, before the VM is fully available. This defines a new VM define (UMA_MD_SMALL_ALLOC_NEEDS_VM) that allows an MD UMA small allocator to become available at the same time as the default UMA allocator.

Approved by: marcel (mentor)


181887 20-Aug-2008 julian

A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.


181811 17-Aug-2008 kmacy

Work around differences in page allocation for initial page tables on xen

MFC after: 1 month


181693 13-Aug-2008 emaste

Fix REDZONE(9) on amd64 and perhaps other 64 bit targets -- ensure the space
that redzone adds to the allocation for storing its metadata is at least as
large as the metadata that it will store there.

Submitted by: Nima Misaghian


181334 05-Aug-2008 jhb

If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks


181239 03-Aug-2008 trhodes

Fill in a few sysctl descriptions.

Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by: alc


181024 30-Jul-2008 jhb

One more whitespace nit.


181020 30-Jul-2008 jhb

A few more whitespace fixes.


181019 30-Jul-2008 jhb

If the kernel has run out of metadata for swap, then explicitly panic()
instead of emitting a warning before deadlocking.

MFC after: 1 month


181004 30-Jul-2008 kib

The behaviour of the lockmgr going back at least to the 4.4BSD-Lite2 was
to downgrade the exclusive lock to shared one when exclusive lock owner
requested shared lock. New lockmgr panics instead.

The vnode_pager_lock function requests shared lock on the vnode backing
the OBJT_VNODE, and can be called when the current thread already holds
an exlcusive lock on the vnode. For instance, it happens when handling
page fault from the VOP_WRITE() uiomove that writes to the file, with
the faulted in page fetched from the vm object backed by the same file.
We then get the situation described above.

Verify whether the vnode is already exclusively locked by the curthread
and request recursed exclusive vnode lock instead of shared, if true.

Reported by: gallatin
Discussed with: attilio


180598 18-Jul-2008 alc

Eliminate stale comments from kmem_malloc().


180446 11-Jul-2008 kib

Use the VM_ALLOC_INTERRUPT for the page requests when allocating memory
for the bio for swapout write. It allows the page allocator to drain
free page list deeper. As result, a deadlock where pageout deamon sleeps
waiting for bio to be allocated for swapout is no more reproducable in
practice.

Alan said that M_USE_RESERVE shall be ressurrected and used there, but
until this is implemented, M_NOWAIT does exactly what is needed.

Tested by: pho, kris
Reviewed by: alc
No objections from: phk
MFC after: 2 weeks (RELENG_7 only)


180308 05-Jul-2008 alc

Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout


179923 22-Jun-2008 alc

Make preparations for increasing the size of the kernel virtual address space
on the amd64 architecture. The amd64 architecture requires kernel code and
global variables to reside in the highest 2GB of the 64-bit virtual address
space. Thus, the memory allocated during bootstrap, before the call to
kmem_init(), starts at KERNBASE, which is not necessarily the same as
VM_MIN_KERNEL_ADDRESS on amd64.


179921 21-Jun-2008 alc

KERNBASE is not necessarily an address within the kernel map, e.g.,
PowerPC/AIM. Consequently, it should not be used to determine the maximum
number of kernel map entries. Intead, use VM_MIN_KERNEL_ADDRESS, which marks
the start of the kernel map on all architectures.

Tested by: marcel@ (PowerPC/AIM)


179765 12-Jun-2008 ups

Fix vm object creation locking to allow SHARED vnode locking for vnode_create_vobject.
(Not currently used)

Noticed by: kib@


179623 06-Jun-2008 alc

Essentially, neither madvise(..., MADV_DONTNEED) nor madvise(..., MADV_FREE)
work. (Moreover, I don't believe that they have ever worked as intended.)
The explanation is fairly simple. Both MADV_DONTNEED and MADV_FREE perform
vm_page_dontneed() on each page within the range given to madvise(). This
function moves the page to the inactive queue. Specifically, if the page is
clean, it is moved to the head of the inactive queue where it is first in
line for processing by the page daemon. On the other hand, if it is dirty,
it is placed at the tail. Let's further examine the case in which the page
is clean. Recall that the page is at the head of the line for processing by
the page daemon. The expectation of vm_page_dontneed()'s author was that
the page would be transferred from the inactive queue to the cache queue by
the page daemon. (Once the page is in the cache queue, it is, in effect,
free, that is, it can be reallocated to a new vm object by vm_page_alloc()
if it isn't reactivated quickly enough by a user of the old vm object.) The
trouble is that nowhere in the execution of either MADV_DONTNEED or
MADV_FREE is either the machine-independent reference flag (PG_REFERENCED)
or the reference bit in any page table entry (PTE) mapping the page cleared.
Consequently, the immediate reaction of the page daemon is to reactivate the
page because it is referenced. In effect, the madvise() was for naught.
The case in which the page was dirty is not too different. Instead of being
laundered, the page is reactivated.

Note: The essential difference between MADV_DONTNEED and MADV_FREE is
that MADV_FREE clears a page's dirty field. So, MADV_FREE is always
executing the clean case above.

This revision changes vm_page_dontneed() to clear both the machine-
independent reference flag (PG_REFERENCED) and the reference bit in all PTEs
mapping the page.

MFC after: 6 weeks


179296 24-May-2008 alc

To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped. Specifically, it has
returned EINVAL if the entire range is not mapped. There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement. This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks


179159 20-May-2008 ups

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


179081 18-May-2008 alc

Retire pmap_addr_hint(). It is no longer used.


179076 17-May-2008 alc

In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping. Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address. Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address. If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated. In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().


179074 17-May-2008 alc

Preset a device object's alignment ("pg_color") based upon the
physical address of the device's memory. This enables
pmap_align_superpage() to propose a virtual address for mapping the
device memory that permits the use of superpage mappings.


179019 15-May-2008 alc

Don't call vm_reserv_alloc_page() on device-backed objects. Otherwise, the
system may panic because there is no reservation structure corresponding to
the physical address of the device memory.

Reported by: Giorgos Keramidas


178935 10-May-2008 alc

Provide the new argument to kmem_suballoc().


178933 10-May-2008 alc

Introduce a new parameter "superpage_align" to kmem_suballoc() that is
used to request superpage alignment for the submap.

Request superpage alignment for the kmem_map.

Pass VMFS_ANY_SPACE instead of TRUE to vm_map_find(). (They are currently
equivalent but VMFS_ANY_SPACE is the new preferred spelling.)

Remove a stale comment from kmem_malloc().


178928 10-May-2008 alc

Generalize vm_map_find(9)'s parameter "find_space". Specifically, add
support for VMFS_ALIGNED_SPACE, which requests the allocation of an
address range best suited to superpages. The old options TRUE and FALSE
are mapped to VMFS_ANY_SPACE and VMFS_NO_SPACE, so that there is no
immediate need to update all of vm_map_find(9)'s callers.

While I'm here, correct a misstatement about vm_map_find(9)'s return
values in the man page.


178875 09-May-2008 alc

Introduce pmap_align_superpage(). It increases the starting virtual
address of the given mapping if a different alignment might result in more
superpage mappings.


178792 05-May-2008 kmacy

add malloc flag to blist so that it can be used in ithread context

Reviewed by: alc, bsdimp


178637 28-Apr-2008 alc

Eliminate pointless casts from kmem_suballoc().


178630 28-Apr-2008 alc

vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.


178272 17-Apr-2008 jeff

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


177956 06-Apr-2008 alc

Introduce vm_reserv_reclaim_contig(). This function is used by
contigmalloc(9) as a last resort to steal pages from an inactive,
partially-used superpage reservation.

Rename vm_reserv_reclaim() to vm_reserv_reclaim_inactive() and
refactor it so that a separate subroutine is responsible for breaking
the selected reservation. This subroutine is also used by
vm_reserv_reclaim_contig().


177932 05-Apr-2008 alc

Eliminate an unnecessary test from vm_phys_unfree_page().


177922 04-Apr-2008 alc

Update a comment to vm_map_pmap_enter().


177921 04-Apr-2008 alc

Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks


177762 30-Mar-2008 alc

Eliminate an unnecessary printf() from kmem_suballoc(). The subsequent
panic() can be extended to convey the same information.


177704 29-Mar-2008 jeff

- Use vm_object_reference_locked() directly from
vm_object_reference(). This is intended to get rid of vget()
consumers who don't wish to acquire a lock. This is functionally
the same as calling vref(). vm_object_reference_locked() already
uses vref.

Discussed with: alc


177458 20-Mar-2008 kib

Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.

PR: kern/119422
MFC after: 1 week


177414 19-Mar-2008 alc

Rename vm_pageq_requeue() to vm_page_requeue() on account of its recent
migration to vm/vm_page.c.


177368 19-Mar-2008 jeff

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


177342 18-Mar-2008 alc

Almost seven years ago, vm/vm_page.c was split into three parts:
vm/vm_contig.c, vm/vm_page.c, and vm/vm_pageq.c. Today, vm/vm_pageq.c
has withered to the point that it contains only four short functions,
two of which are only used by vm/vm_page.c. Since I can't foresee any
reason for vm/vm_pageq.c to grow, it is time to fold the remaining
contents of vm/vm_pageq.c back into vm/vm_page.c.

Add some comments. Rename one of the functions, vm_pageq_enqueue(),
that is now static within vm/vm_page.c to vm_page_enqueue().
Eliminate PQ_MAXCOUNT as it no longer serves any purpose.


177261 16-Mar-2008 alc

Simplify the inner loop of vm_fault()'s delete-behind heuristic.
Instead of checking each page for PG_UNMANAGED, perform a one-time
check whether the object is OBJT_PHYS. (PG_UNMANAGED pages only
belong to OBJT_PHYS objects.)


177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


177085 12-Mar-2008 jeff

- Pass the priority argument from *sleep() into sleepq and down into
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.

Reviewed by: jhb, peter


176967 09-Mar-2008 alc

Eliminate an unnecessary test from vm_fault's delete-behind heuristic.
Specifically, since the delete-behind heuristic is never applied to a
device-backed object, there is no point in checking whether each of the
object's pages is fictitious. (Only device-backed objects have
fictitious pages.)


176717 01-Mar-2008 marcel

Make the vm_pmap field of struct vmspace the last field in the
structure. This allows per-CPU variations of struct pmap on a
single architecture without affecting the machine-independent
fields. As such, the PMAP variations don't affect the ABI. They
become part of it.


176596 26-Feb-2008 alc

Correct a long-standing error in vm_object_page_remove(). Specifically,
pmap_remove_all() must not be called on fictitious pages. To date,
fictitious pages have been allocated from zeroed memory, effectively
hiding this problem because the fictitious pages appear to have an empty
pv list. Submitted by: Kostik Belousov

Rewrite the comments describing vm_object_page_remove() to better
describe what it does. Add an assertion. Reviewed by: Kostik Belousov

MFC after: 1 week


176526 24-Feb-2008 alc

Correct a long-standing error in vm_object_deallocate(). Specifically,
only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should
ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was
setting it on device (OBJT_DEVICE) objects. As a result,
vm_object_page_remove() could be called on a device object and if that
occurred pmap_remove_all() would be called on the device object's pages.
However, a device object's pages are fictitious, and fictitious pages do
not have an initialized pv list (struct md_page).

To date, fictitious pages have been allocated from zeroed memory,
effectively hiding this problem. Now, however, the conversion of rotting
diagnostics to invariants in the amd64 and i386 pmaps has revealed the
problem. Specifically, assertion failures have occurred during the
initialization phase of the X server on some hardware.

MFC after: 1 week
Discussed with: Kostik Belousov
Reported by: Michiel Boland


175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


175210 10-Jan-2008 pjd

When one tries to allocate memory with the M_WAITOK flag and we are short in
address space in kmem map call vm_lowmem event in a loop and wait a bit for
subsystems to reclaim some memory which in turn will reclaim address space as
well.

Note, this is a work-around.

Reviewed by: alc
Approved by: alc
MFC after: 3 days


175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


175164 08-Jan-2008 jhb

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


175157 08-Jan-2008 csjp

When MAC is enabled in the kernel, fix a panic triggered by a locking
assertion hit in swapoff_one() when we un-mount a swap partition. We
should be using curthread where we used thread0 before. This change
also replaces the thread argument with a credential argument, as the
MAC framework only requires the cred.

It should be noted that this allows the machine to be rebooted without
panicing with "cannot differ from curthread or NULL" when MAC is enabled.

Submitted by: rwatson
Reviewed by: attilio
MFC after: 2 weeks


175079 04-Jan-2008 kib

In the vm_map_stack(), check for the specified stack region wraparound.

Reported and tested by: Peter Holm
Reviewed by: alc
MFC after: 3 days


175067 03-Jan-2008 alc

Add an access type parameter to pmap_enter(). It will be used to implement
superpage promotion.

Correct a style error in kmem_malloc(): pmap_enter()'s last parameter is
a Boolean.


175055 02-Jan-2008 alc

Defer setting either PG_CACHED or PG_FREE until after the free page
queues lock is acquired. Otherwise, the state of a reservation's
pages' flags and its population count can be inconsistent. That could
result in a page being freed twice.

Reported by: kris


175041 01-Jan-2008 alc

Correct a style error that was introduced in revision 1.77.


174982 29-Dec-2007 alc

Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)


174940 27-Dec-2007 alc

Add a list of reservations to the vm object structure.

Recycle the vm object's "pg_color" field to represent the color of the
first virtual page address at which the object is mapped instead of the
color of the object's first physical page. Since an object may not be
mapped, introduce a flag "OBJ_COLORED" that indicates whether "pg_color"
is valid.


174939 27-Dec-2007 alc

Add the superpage reservation type.


174825 21-Dec-2007 alc

Update the comment describing vm_phys_unfree_page().


174821 20-Dec-2007 alc

Modify vm_phys_unfree_page() so that it no longer requires the given
page to be in the free lists. Instead, it now returns TRUE if it
removed the page from the free lists and FALSE if the page was not
in the free lists.

This change is required to support superpage reservations. Specifically,
once reservations are introduced, a cached page can either be in the
free lists or a reservation.


174799 19-Dec-2007 alc

Correct one half of a loop continuation condition in vm_phys_unfree_page().
At present, this error is inconsequential; the other half of the loop
continuation condition is sufficient to achieve correct execution.


174769 19-Dec-2007 alc

Eliminate redundant code from vm_page_startup().


174543 11-Dec-2007 alc

Simplify vm_page_free_toq().


174142 02-Dec-2007 alc

Correct a comment.


174137 01-Dec-2007 rwatson

Modify stack(9) stack_print() and stack_sbuf_print() routines to use new
linker interfaces for looking up function names and offsets from
instruction pointers. Create two variants of each call: one that is
"DDB-safe" and avoids locking in the linker, and one that is safe for
use in live kernels, by virtue of observing locking, and in particular
safe when kernel modules are being loaded and unloaded simultaneous to
their use. This will allow them to be used outside of debugging
contexts.

Modify two of three current stack(9) consumers to use the DDB-safe
interfaces, as they run in low-level debugging contexts, such as inside
lockmgr(9) and the kernel memory allocator.

Update man page.


173918 25-Nov-2007 alc

Make contigmalloc(9)'s page laundering more robust. Specifically, use
vm_pageout_fallback_object_lock() in vm_contig_launder_page() to better
handle a lock-ordering problem. Consequently, trylock's failure on the
page's containing object no longer implies that the page cannot be
laundered.

MFC after: 6 weeks


173901 25-Nov-2007 alc

Tidy up: Add comments. Eliminate the pointless
malloc_type_allocated(..., 0) calls that occur when contigmalloc() has
failed. Eliminate the acquisition and release of the page queues lock
from vm_page_release_contig(). Rename contigmalloc2() to
contigmapping(), reflecting what it does.


173853 23-Nov-2007 alc

Add a read/write sysctl for reconfiguring the maximum number of physical
pages that can be wired.

Submitted by: Eugene Grosbein
PR: 114654
MFC after: 6 weeks


173846 22-Nov-2007 alc

Remove an unnecessary call to pmap_remove_all() and the associated "XXX"
comments from vnode_pager_setsize(). This call was introduced in
revision 1.140 to address a problem that no longer exists.
Specifically, pmap_zero_page_area() has replaced a (possibly)
problematic implementation of page zeroing that was based on
vm_pager_map(), bzero(), and vm_pager_unmap().


173836 21-Nov-2007 alc

When reactivating a cached page, reset the page's pool to the default
pool. (Not doing this before was a performance pessimization but not
a cause for panic.)


173708 17-Nov-2007 alc

Prevent the leakage of wired pages in the following circumstances:
First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated.
Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the
pages beyond the EOF are unmapped and freed. However, when the file is
mlock(2)ed, the pages beyond the EOF are unmapped but not freed because
they have a non-zero wire count. This can be a mistake. Specifically,
it is a mistake if the sole reason why the pages are wired is because of
wired, managed mappings. Previously, unmapping the pages destroys these
wired, managed mappings, but does not reduce the pages' wire count.
Consequently, when the file is unmapped, the pages are not unwired
because the wired mapping has been destroyed. Moreover, when the vm
object is finally destroyed, the pages are leaked because they are still
wired. The fix is to reduce the pages' wired count by the number of
wired, managed mappings destroyed. To do this, I introduce a new pmap
function pmap_page_wired_mappings() that returns the number of managed
mappings to the given physical page that are wired, and I use this
function in vm_object_page_remove().

Reviewed by: tegge
MFC after: 6 weeks


173429 07-Nov-2007 pjd

Change unused 'user_wait' argument to 'timo' argument, which will be
used to specify timeout for msleep(9).

Discussed with: alc
Reviewed by: alc


173361 05-Nov-2007 kib

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


173357 05-Nov-2007 kib

The intent of the freeing the (zeroed) page in vm_page_cache() for
default object rather than cache it was to have
vm_pager_has_page(object, pindex, ...) == FALSE to imply that there is
no cached page in object at pindex. This allows to avoid explicit
checks for cached pages in vm_object_backing_scan().

For now, we need the same bandaid for the swap object, otherwise both
the vm_page_lookup() and the pager can report that there is no page at
offset, while page is stored in the cache. Also, this fixes another
instance of the KASSERT("object type is incompatible") failure in the
vm_page_cache_transfer().

Reported and tested by: Peter Holm
Reviewed by: alc
MFC after: 3 days


173292 02-Nov-2007 maxim

o Fix panic message: it's swap_pager_putpages() not swap_pager_getpages().

Submitted by: Mark Tinguely


173180 30-Oct-2007 remko

Correct a copy and paste'o in phys_pager.c, we are talking about phys here
and not about devices.

PR: 93755
Approved by: imp (mentor, implicit when re-assigning the ticket to me).


173049 27-Oct-2007 alc

Change vm_page_cache_transfer() such that it does not transfer pages
that would have an offset beyond the end of the target object. Such
pages should remain in the source object.

MFC after: 3 days
Diagnosed and reviewed by: Kostik Belousov
Reported and tested by: Peter Holm


172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


172875 22-Oct-2007 alc

Correct an error of omission in the reimplementation of the page
cache: vnode_pager_setsize() must handle the case where a file is
truncated to a non-page-size-aligned boundary and there is a cached
page underlying the new end of file.

Reported by: kris, tegge
Tested by: kris
MFC after: 3 days


172863 22-Oct-2007 alc

Correct an error in vm_map_sync(), nee vm_map_clean(), that has existed
since revision 1.1. Specifically, neither traversal of the vm map checks
whether the end of the vm map has been reached. Consequently, the first
traversal can wrap around and bogusly return an error.

This error has gone unnoticed for so long because no one had ever before
tried msync(2)ing a region above the stack.

Reported by: peter
MFC after: 1 week


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172780 18-Oct-2007 alc

The previous revision, updating vm_object_page_remove() for the new page
cache, did not account for the case where the vm object has nothing but
cached pages.

Reported by: kris, tegge
Reviewed by: tegge
MFC after: 3 days


172779 18-Oct-2007 peter

Fix cosmetic bug in stale copy of msync_args. 'len' is size_t, not int.


172700 16-Oct-2007 ru

Fix CTL_VM_NAMES.


172545 11-Oct-2007 jhb

Allow recursion on the 'zones' internal UMA zone.

Submitted by: thompsa
MFC after: 1 week
Approved by: re (kensmith)
Discussed with: jeff


172475 08-Oct-2007 kib

Do not dereference NULL pointer.

Reported by: Peter Holm
Reviewed by: alc
Approved by: re (kensmith)


172472 08-Oct-2007 alc

In the rare case that vm_page_cache() actually frees the given page,
it must first ensure that the page is no longer mapped. This is
trivially accomplished by calling pmap_remove_all() a little earlier
in vm_page_cache(). While I'm in the neighborbood, make a related
panic message a little more useful.

Approved by: re (kensmith)
Reported by: Peter Holm and Konstantin Belousov
Reviewed by: Konstantin Belousov


172466 07-Oct-2007 alc

Correct a lock assertion failure in sparc64's pmap_page_is_mapped() that is
a consequence of sparc64/sparc64/vm_machdep.c revision 1.76. It occurs
when uma_small_free() frees a page. The solution has two parts: (1) Mark
pages allocated with VM_ALLOC_NOOBJ as PG_UNMANAGED. (2) Defer the lock
assertion in pmap_page_is_mapped() until after PG_UNMANAGED is tested.
This is safe because both PG_UNMANAGED and PG_FICTITIOUS are immutable
flags, i.e., they do not change state between the time that a page is
allocated and freed.

Approved by: re (kensmith)
PR: 116794


172341 27-Sep-2007 alc

Correct an error of omission in the reimplementation of the page
cache: vm_object_page_remove() should convert any cached pages that
fall with the specified range to free pages. Otherwise, there could
be a problem if a file is first truncated and then regrown.
Specifically, some old data from prior to the truncation might reappear.

Generalize vm_page_cache_free() to support the conversion of either a
subset or the entirety of an object's cached pages.

Reported by: tegge
Reviewed by: tegge
Approved by: re (kensmith)


172322 25-Sep-2007 alc

Correct an error in the previous revision, specifically,
vm_object_madvise() should request that the reactivated, cached page
not be busied.

Reported by: Rink Springer
Approved by: re (kensmith)


172317 25-Sep-2007 alc

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


172268 21-Sep-2007 jeff

- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.

Approved by: re


172207 17-Sep-2007 jeff

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


172188 15-Sep-2007 alc

Correct an assertion in vm_pageout_flush(). Specifically, if a page's
status after vm_pager_put_pages() is VM_PAGER_PEND, then it could have
already been recycled, i.e., freed and reallocated to a new purpose;
thus, asserting that such pages cannot be written is inappropriate.

Reported by: kris
Submitted by: tegge
Approved by: re (kensmith)
MFC after: 1 week


171902 20-Aug-2007 kib

Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert().
For this, introduce vm_map_fixed() that does that for MAP_FIXED case.

Dropping the lock allowed for parallel thread to occupy the freed space.

Reported by: Tijl Coosemans <tijl ulyssis org>
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


171889 18-Aug-2007 kib

Remove comment that is no longer quite true.

Noted by: alc
Approved by: re (kensmith)


171887 18-Aug-2007 kib

Fix the phys_pager in the way similar to the rev. 1.83 of the
sys/vm/device_pager.c:

Protect the creation of the phys pager with non-NULL handle with the
phys_pager_mtx. Lookup of phys pager in the pagers list by handle is now
synchronized with its removal from the list, and phys_pager_mtx is put
before vm object lock in lock order. Dispose the phys_pager_alloc_lock
and tsleep calls, together with acquiring Giant, since phys_pager_mtx
now covers the same block.

Reviewed by: alc
Approved by: re (kensmith)


171779 07-Aug-2007 kib

Protect the creation of the device pager with the dev_pager_mtx. Lookup
of device pager in the pagers list by handle is now synchronized with
its removal from the list, and dev_pager_mtx is put before vm object
lock in lock order. Dispose the dev_pager_sx lock, since dev_pager_mtx
now covers the same block.

Noted by: kensmith
Reviewed by: alc
Approved by: re (kensmith)


171737 05-Aug-2007 alc

Consider a scenario in which one processor, call it Pt, is performing
vm_object_terminate() on a device-backed object at the same time that
another processor, call it Pa, is performing dev_pager_alloc() on the
same device. The problem is that vm_pager_object_lookup() should not be
allowed to return a doomed object, i.e., an object with OBJ_DEAD set,
but it does. In detail, the unfortunate sequence of events is: Pt in
vm_object_terminate() holds the doomed object's lock and sets OBJ_DEAD
on the object. Pa in dev_pager_alloc() holds dev_pager_sx and calls
vm_pager_object_lookup(), which returns the doomed object. Next, Pa
calls vm_object_reference(), which requires the doomed object's lock, so
Pa waits for Pt to release the doomed object's lock. Pt proceeds to the
point in vm_object_terminate() where it releases the doomed object's
lock. Pa is now able to complete vm_object_reference() because it can
now complete the acquisition of the doomed object's lock. So, now the
doomed object has a reference count of one! Pa releases dev_pager_sx
and returns the doomed object from dev_pager_alloc(). Pt now acquires
dev_pager_mtx, removes the doomed object from dev_pager_object_list,
releases dev_pager_mtx, and finally calls uma_zfree with the doomed
object. However, the doomed object is still in use by Pa.

Repeating my key point, vm_pager_object_lookup() must not return a
doomed object. Moreover, the test for the object's state, i.e.,
doomed or not, and the increment of the object's reference count
should be carried out atomically.

Reviewed by: kib
Approved by: re (kensmith)
MFC after: 3 weeks


171725 05-Aug-2007 kib

Do not acquire Giant unconditionally around the calls to the cdevsw
d_mmap methods. prep_cdevsw() already installs the shims that
acquire/drop Giant for the methods of a driver that specified the
D_NEEDGIANT flag.

Reviewed by: alc
Approved by: re (kensmith)


171633 27-Jul-2007 alc

Add a counter for the total number of pages cached and support for
reporting the value of this counter in the program "vmstat".

Approved by: re (rwatson)


171599 26-Jul-2007 pjd

When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with: kib, ups
Approved by: re (rwatson)


171514 20-Jul-2007 alc

Two changes to vm_fault_additional_pages():

1. Rewrite the backward scan. Specifically, reverse the order in which
pages are allocated so that upon failure it is never necessary to
free pages that were just allocated. Moreover, any allocated pages
can be put to use. This makes the backward scan behave just like the
forward scan.

2. Eliminate an explicit, unsynchronized check for low memory before
calling vm_page_alloc(). It serves no useful purpose. It is, in
effect, optimizing the uncommon case at the expense of the common
case.

Approved by: re (hrs)
MFC after: 3 weeks


171451 14-Jul-2007 alc

Eliminate two unused functions: vm_phys_alloc_pages() and
vm_phys_free_pages(). Rename vm_phys_alloc_pages_locked() to
vm_phys_alloc_pages() and vm_phys_free_pages_locked() to
vm_phys_free_pages(). Add comments regarding the need for the free page
queues lock to be held by callers to these functions. No functional
changes.

Approved by: re (hrs)


171445 14-Jul-2007 alc

Eliminate dead code, specifically, an unused sysctl: "vm.idlezero_maxrun".

Approved by: re (hrs)


171420 13-Jul-2007 alc

Update a comment describing the page queues.

Approved by: re (hrs)


171417 12-Jul-2007 alc

Eliminate dead code.

Approved by: re (hrs)


171347 10-Jul-2007 alc

Correct a problem in the ZERO_COPY_SOCKETS option, specifically, in
vm_page_cowfault(). Initially, if vm_page_cowfault() sleeps, the given
page is wired, preventing it from being recycled. However, when
transmission of the page completes, the page is unwired and returned to
the page queues. At that point, the page is not in any special state
that prevents it from being recycled. Consequently, vm_page_cowfault()
should verify that the page is still held by the same vm object before
retrying the replacement of the page. Note: The containing object is,
however, safe from being recycled by virtue of having a non-zero
paging-in-progress count.

While I'm here, add some assertions and comments.

Approved by: re (rwatson)
MFC After: 3 weeks


171310 08-Jul-2007 alc

Eliminate the special case handling of OBJT_DEVICE objects in
vm_fault_additional_pages() that was introduced in revision 1.47. Then
as now, it is unnecessary because dev_pager_haspage() returns zero for
both the number of pages to read ahead and read behind, producing the
same exact behavior by vm_fault_additional_pages() as the special case
handling.

Approved by: re (rwatson)


171288 06-Jul-2007 alc

When a cached page is reactivated in vm_fault(), update the counter that
tracks the total number of reactivated pages. (We have not been
counting reactivations by vm_fault() since revision 1.46.)

Correct a comment in vm_fault_additional_pages().

Approved by: re (kensmith)
MFC after: 1 week


171212 04-Jul-2007 peter

Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate

Approved by: re (kensmith)


171150 02-Jul-2007 alc

In the previous revision, when I replaced the unconditional acquisition
of Giant in vm_pageout_scan() with VFS_LOCK_GIANT(), I had to eliminate
the acquisition of the vnode interlock before releasing the vm object's
lock because the vnode interlock cannot be held when VFS_LOCK_GIANT() is
performed. Unfortunately, this allows the vnode to be recycled between
the release of the vm object's lock and the vget() on the vnode.

In this revision, I prevent the vnode from being recycled by acquiring
another reference to the vm object and underlying vnode before releasing
the vm object's lock.

This change also addresses another preexisting but trivial problem. By
acquiring another reference to the vm object, I also prevent the vm
object from being recycled. Previously, the "vnodes skipped" counter
could be wrong because if it examined a recycled vm object.

Reported by: kib
Reviewed by: kib
Approved by: re (kensmith)
MFC after: 3 weeks


171048 26-Jun-2007 alc

Eliminate the use of Giant from vm_daemon(). Replace the unconditional
use of Giant in vm_pageout_scan() with VFS_LOCK_GIANT().

Approved by: re (kensmith)
MFC after: 3 weeks


171019 24-Jun-2007 alc

Eliminate GIANT_REQUIRED from swap_pager_putpages().

Approved by: re (mux)
MFC after: 1 week


170905 18-Jun-2007 alc

Eliminate unnecessary checks from vm_pageout_clean(): The page that is
passed to vm_pageout_clean() cannot possibly be PG_UNMANAGED because
it came from the inactive queue and PG_UNMANAGED pages are not in any
page queue. Moreover, PG_UNMANAGED pages only exist in OBJT_PHYS
objects, and all pages within a OBJT_PHYS object are PG_UNMANAGED.
So, if the page that is passed to vm_pageout_clean() is not
PG_UNMANAGED, then it cannot be from an OBJT_PHYS object and its
neighbors from the same object cannot themselves be PG_UNMANAGED.

Reviewed by: tegge


170865 17-Jun-2007 mjacob

Don't declare inline a function which isn't.


170864 17-Jun-2007 mjacob

Make sure object is NULL- there is a possible case where you could
fall through to it being used w/o being set. Put a break in the default
case.


170863 17-Jun-2007 mjacob

Initialize reqpage to zero.


170836 16-Jun-2007 alc

If attempting to cache a "busy", panic instead of printing a diagnostic
message and returning.


170818 16-Jun-2007 alc

Update a comment.


170816 16-Jun-2007 alc

Enable the new physical memory allocator.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


170658 13-Jun-2007 alc

Eliminate dead code: We have not performed pageouts on the kernel object
in this millenium.


170529 11-Jun-2007 alc

Conditionally acquire Giant in vm_contig_launder_page().


170517 10-Jun-2007 attilio

Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)


170477 10-Jun-2007 alc

Add a new physical memory allocator. However, do not yet connect it
to the build.

This allocator uses a binary buddy system with a twist. First and
foremost, this allocator is required to support the implementation of
superpages. As a side effect, it enables a more robust implementation
of contigmalloc(9). Moreover, this reimplementation of
contigmalloc(9) eliminates the acquisition of Giant by
contigmalloc(..., M_NOWAIT, ...).

The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages. Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space. The performance benefits vary. In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.

This allocator does not implement page coloring. The reason is that
superpages have much the same effect. The contiguous physical memory
allocation necessary for a superpage is inherently colored.

Finally, the one caveat is that this allocator does not effectively
support prezeroed pages. I hope this is temporary. On i386, this is
a slight pessimization. However, on amd64, the beneficial effects of
the direct-map optimization outweigh the ill effects. I speculate
that this is true in general of machines with a direct map.

Approved by: re


170307 05-Jun-2007 jeff

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


170292 04-Jun-2007 attilio

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


170291 04-Jun-2007 attilio

Rework the PCPU_* (MD) interface:
- Rename PCPU_LAZY_INC into PCPU_INC
- Add the PCPU_ADD interface which just does an add on the pcpu member
given a specific value.

Note that for most architectures PCPU_INC and PCPU_ADD are not safe.
This is a point that needs some discussions/work in the next days.

Reviewed by: alc, bde
Approved by: jeff (mentor)


170174 01-Jun-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


170170 31-May-2007 attilio

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


170152 31-May-2007 kib

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


170149 31-May-2007 attilio

Add functions sx_xlock_sig() and sx_slock_sig().
These functions are intended to do the same actions of sx_xlock() and
sx_slock() but with the difference to perform an interruptible sleep, so
that sleep can be interrupted by external events.
In order to support these new featueres, some code renstruction is needed,
but external API won't be affected at all.

Note: use "void" cast for "int" returning functions in order to avoid tools
like Coverity prevents to whine.

Requested by: rwatson
Tested by: rwatson
Reviewed by: jhb
Approved by: jeff (mentor)


169849 22-May-2007 alc

Eliminate the reactivation of cached pages in vm_fault_prefault() and
vm_map_pmap_enter() unless the caller is madvise(MADV_WILLNEED). With
the exception of calls to vm_map_pmap_enter() from
madvise(MADV_WILLNEED), vm_fault_prefault() and vm_map_pmap_enter()
are both used to create speculative mappings. Thus, always
reactivating cached pages is a mistake. In principle, cached pages
should only be reactivated by an actual access. Otherwise, the
following misbehavior can occur. On a hard fault for a text page the
clustering algorithm fetches not only the required page but also
several of the adjacent pages. Now, suppose that one or more of the
adjacent pages are never accessed. Ultimately, these unused pages
become cached pages through the efforts of the page daemon. However,
the next activation of the executable reactivates and maps these
unused pages. Consequently, they are never replaced. In effect, they
become pinned in memory.


169805 20-May-2007 jeff

- rename VMCNT_DEC to VMCNT_SUB to reflect the count argument.

Suggested by: julian@
Contributed by: attilio@


169667 18-May-2007 jeff

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


169431 09-May-2007 rwatson

Update stale comment on protecting UMA per-CPU caches: we now use
critical sections rather than mutexes.


169291 05-May-2007 alc

Define every architecture as either VM_PHYSSEG_DENSE or
VM_PHYSSEG_SPARSE depending on whether the physical address space is
densely or sparsely populated with memory. The effect of this
definition is to determine which of two implementations of
vm_page_array and PHYS_TO_VM_PAGE() is used. The legacy
implementation is obtained by defining VM_PHYSSEG_DENSE, and a new
implementation that trades off time for space is obtained by defining
VM_PHYSSEG_SPARSE. For now, all architectures except for ia64 and
sparc64 define VM_PHYSSEG_DENSE. Defining VM_PHYSSEG_SPARSE on ia64
allows the entirety of my Itanium 2's memory to be used. Previously,
only the first 1 GB could be used. Defining VM_PHYSSEG_SPARSE on
sparc64 allows USIIIi-based systems to boot without crashing.

This change is a combination of Nathan Whitehorn's patch and my own
work in perforce.

Discussed with: kmacy, marius, Nathan Whitehorn
PR: 112194


169048 26-Apr-2007 alc

Remove some code from vmspace_fork() that became redundant after
revision 1.334 modified _vm_map_init() to initialize the new vm map's
flags to zero.


168979 23-Apr-2007 rwatson

Audit pathnames looked up in swapon(2) and swapoff(2).

MFC after: 2 weeks
Obtained from: TrustedBSD Project


168852 19-Apr-2007 alc

Correct contigmalloc2()'s implementation of M_ZERO. Specifically,
contigmalloc2() was always testing the first physical page for PG_ZERO,
not the current page of interest.

Submitted by: Michael Plass
PR: 81301
MFC after: 1 week


168851 19-Apr-2007 alc

Correct two comments.

Submitted by: Michael Plass


168581 10-Apr-2007 keramida

Minor typo fix, noticed while I was going through *_pager.c files.


168395 05-Apr-2007 pjd

When KVA is exhausted, try the vm_lowmem event for the last time before
panicing. This helps a lot in ZFS stability.


168394 05-Apr-2007 pjd

Fix a problem for file systems that don't implement VOP_BMAP() operation.

The problem is this: vm_fault_additional_pages() calls vm_pager_has_page(),
which calls vnode_pager_haspage(). Now when VOP_BMAP() returns an error (eg.
EOPNOTSUPP), vnode_pager_haspage() returns TRUE without initializing 'before'
and 'after' arguments, so we have some accidental values there. This bascially
was causing this condition to be meet:

if ((rahead + rbehind) >
((cnt.v_free_count + cnt.v_cache_count) - cnt.v_free_reserved)) {
pagedaemon_wakeup();
[...]
}

(we have some random values in rahead and rbehind variables)

I'm not entirely sure this is the right fix, maybe we should just return FALSE
in vnode_pager_haspage() when VOP_BMAP() fails?

alc@ knows about this problem, maybe he will be able to come up with a better
fix if this is not the right one.


167939 27-Mar-2007 alc

Prevent a race between vm_object_collapse() and vm_object_split() from
causing a crash.

Suppose that we have two objects, obj and backing_obj, where
backing_obj is obj's backing object. Further, suppose that
backing_obj has a reference count of two. One being the reference
held by obj and the other by a map entry. Now, suppose that the map
entry is deallocated and its reference removed by
vm_object_deallocate(). vm_object_deallocate() recognizes that the
only remaining reference is from a shadow object, obj, and calls
vm_object_collapse() on obj. vm_object_collapse() executes

if (backing_object->ref_count == 1) {
/*
* If there is exactly one reference to the backing
* object, we can collapse it into the parent.
*/
vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT);

vm_object_backing_scan(OBSC_COLLAPSE_WAIT) executes

if (op & OBSC_COLLAPSE_WAIT) {
vm_object_set_flag(backing_object, OBJ_DEAD);
}

Finally, suppose that either vm_object_backing_scan() or
vm_object_collapse() sleeps releasing its locks. At this instant,
another thread executes vm_object_split(). It crashes in
vm_object_reference_locked() on the assertion that the object is not
dead. If, however, assertions are not enabled, it crashes much later,
after the object has been recycled, in vm_object_deallocate() because
the shadow count and shadow list are inconsistent.

Reviewed by: tegge
Reported by: jhb
MFC after: 1 week


167880 25-Mar-2007 alc

Two small changes to vm_map_pmap_enter():

1) Eliminate an unnecessary check for fictitious pages. Specifically,
only device-backed objects contain fictitious pages and the object is
not device-backed.

2) Change the types of "psize" and "tmpidx" to vm_pindex_t in order to
prevent possible wrap around with extremely large maps and objects,
respectively. Observed by: tegge (last summer)


167829 23-Mar-2007 alc

vm_page_busy() no longer requires the page queues lock to be held. Reduce
the scope of the page queues lock in vm_fault() accordingly.


167795 22-Mar-2007 alc

Change the order of lock reacquisition in vm_object_split() in order to
simplify the code slightly. Add a comment concerning lock ordering.


167243 05-Mar-2007 alc

Use PCPU_LAZY_INC() to update page fault statistics.


167091 27-Feb-2007 jhb

Use pause() in vm_object_deallocate() to yield the CPU to the lock holder
rather than a tsleep() on &proc0. The only wakeup on &proc0 is intended
to awaken the swapper, not random threads blocked in
vm_object_deallocate().


167086 27-Feb-2007 jhb

Use pause() rather than tsleep() on stack variables and function pointers.


166964 25-Feb-2007 alc

Change the way that unmanaged pages are created. Specifically,
immediately flag any page that is allocated to a OBJT_PHYS object as
unmanaged in vm_page_alloc() rather than waiting for a later call to
vm_page_unmanage(). This allows for the elimination of some uses of
the page queues lock.

Change the type of the kernel and kmem objects from OBJT_DEFAULT to
OBJT_PHYS. This allows us to take advantage of the above change to
simplify the allocation of unmanaged pages in kmem_alloc() and
kmem_malloc().

Remove vm_page_unmanage(). It is no longer used.


166882 22-Feb-2007 alc

Change the page's CLEANCHK flag from being a page queue mutex synchronized
flag to a vm object mutex synchronized flag.


166808 18-Feb-2007 alc

Enable vm_page_free() and vm_page_free_zero() to be called on some pages
without the page queues lock being held, specifically, pages that are not
contained in a vm object and not a member of a page queue.


166805 17-Feb-2007 alc

Remove a stale comment. Add punctuation to a nearby comment.


166736 15-Feb-2007 alc

Relax the page queue lock assertions in vm_page_remove() and
vm_page_free_toq() to account for recent changes that allow
vm_page_free_toq() to be called on some pages without the page queues lock
being held, specifically, pages that are not contained in a vm object and
not a member of a page queue. (Examples of such pages include page table
pages, pv entry pages, and uma small alloc pages.)


166699 14-Feb-2007 alc

Avoid the unnecessary acquisition of the free page queues lock when a page
is actually being added to the hold queue, not the free queue. At the same
time, avoid unnecessary tests to wake up threads waiting for free memory
and the idle thread that zeroes free pages. (These tests will be performed
later when the page finally moves from the hold queue to the free queue.)


166654 11-Feb-2007 rwatson

Add uma_set_align() interface, which will be called at most once during
boot by MD code to indicated detected alignment preference. Rather than
cache alignment being encoded in UMA consumers by defining a global
alignment value of (16 - 1) in UMA_ALIGN_CACHE, UMA_ALIGN_CACHE is now
a special value (-1) that causes UMA to look at registered alignment. If
no preferred alignment has been selected by MD code, a default alignment
of (16 - 1) will be used.

Currently, no hardware platforms specify alignment; architecture
maintainers will need to modify MD startup code to specify an alignment
if desired. This must occur before initialization of UMA so that all UMA
zones pick up the requested alignment.

Reviewed by: jeff, alc
Submitted by: attilio


166637 11-Feb-2007 alc

Use the free page queue mutex instead of the page queue mutex to
synchronize sleeping and waking of the zero idle thread.


166550 07-Feb-2007 jhb

- Move 'struct swdevt' back into swap_pager.h and expose it to userland.
- Restore support for fetching swap information from crash dumps via
kvm_get_swapinfo(3) to fix pstat -T/-s on crash dumps.

Reviewed by: arch@, phk
MFC after: 1 week


166544 07-Feb-2007 alc

Change the pagedaemon, vm_wait(), and vm_waitpfault() to sleep on the
vm page queue free mutex instead of the vm page queue mutex.


166508 05-Feb-2007 alc

Change the free page queue lock from a spin mutex to a default (blocking)
mutex. With the demise of Alpha support, there is no longer a reason for
it to be a spin mutex.


166213 25-Jan-2007 mohans

Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@


166211 24-Jan-2007 mohans

Fix for a bug where only one process (of multiple) blocked on
maxpages on a zone is woken up, with the rest never being woken up as
a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked
procsses instead. This change introduces a thundering herd, but since
this should be relatively infrequent, optimizing this (by introducing
a count of blocked processes, for example) may be premature.

Reviewd by: ups@


166188 23-Jan-2007 jeff

- Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.

Discussed with: julian

- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.

Suggested by: jhb

Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.


166074 17-Jan-2007 delphij

Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.


165928 10-Jan-2007 rwatson

Remove uma_zalloc_arg() hack, which coerced M_WAITOK to M_NOWAIT when
allocations were made using improper flags in interrupt context.
Replace with a simple WITNESS warning call. This restores the
invariant that M_WAITOK allocations will always succeed or die
horribly trying, which is relied on by many UMA consumers.

MFC after: 3 weeks
Discussed with: jhb


165854 07-Jan-2007 alc

Declare the map entry created by kmem_init() for the range from
VM_MIN_KERNEL_ADDRESS to the end of the kernel's bootstrap data as
MAP_NOFAULT.


165809 05-Jan-2007 jhb

- Add a new function uma_zone_exhausted() to see if a zone is full.
- Add a printf in swp_pager_meta_build() to warn if the swapzone becomes
exhausted so that there's at least a warning before a box that runs out
of swapzone space before running out of swap space deadlocks.

MFC after: 1 week
Reviwed by: alc


165309 17-Dec-2006 alc

Optimize vm_object_split(). Specifically, make the number of iterations
equal to the number of physical pages that are renamed to the new object
rather than the new object's virtual size.


165278 16-Dec-2006 alc

Simplify the computation of the new object's size in vm_object_split().


165007 08-Dec-2006 kmacy

Remove the requirement that phys_avail be sorted in ascending order
by explicitly finding the lowest and highest addresses when calculating
the size of the vm_pages array

Reviewed by :alc


164936 06-Dec-2006 julian

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


164446 20-Nov-2006 ru

The clean_map has been made local to vm_init.c long ago.


164437 20-Nov-2006 ru

Remove a redundant pointer-type variable.


164429 20-Nov-2006 ru

When counting vm totals, skip unreferenced objects, including
vnodes representing mounted file systems.

Reviewed by: alc
MFC after: 3 days


164234 13-Nov-2006 alc

There is no point in setting PG_REFERENCED on kmem_object pages because
they are "unmanaged", i.e., non-pageable, pages.

Remove a stale comment.


164229 12-Nov-2006 alc

Make pmap_enter() responsible for setting PG_WRITEABLE instead
of its caller. (As a beneficial side-effect, a high-contention
acquisition of the page queues lock in vm_fault() is eliminated.)


164101 08-Nov-2006 alc

I misplaced the assertion that was added to vm_page_startup() in the
previous change. Correct its placement.


164100 08-Nov-2006 alc

Simplify the construction of the free queues in vm_page_startup(). Add
an assertion to test a hypothesis concerning other redundant computation
in vm_page_startup().


164089 08-Nov-2006 alc

Ensure that the page's oflags field is initialized by contigmalloc().


164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


163709 26-Oct-2006 jb

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


163702 26-Oct-2006 rwatson

Better align output of "show uma" by moving from displaying the basic
counters of allocs/frees/use for each zone to the same statistics
shown by userspace "vmstat -z".

MFC after: 3 days


163622 23-Oct-2006 alc

The page queues lock is no longer required by vm_page_wakeup().


163614 22-Oct-2006 alc

The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup(). Reduce or eliminate its use accordingly.


163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


163604 22-Oct-2006 alc

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


163594 21-Oct-2006 alc

Eliminate unnecessary PG_BUSY tests. They originally served a purpose
that is now handled by vm object locking.


163361 14-Oct-2006 alc

Long ago, revision 1.22 of vm/vm_pager.h introduced a bug. Specifically,
it introduced a check after the call to file system's get pages method
that assumes that the get pages method does not change the array of pages
that is passed to it. In the case of vnode_pager_generic_getpages(),
this assumption has been incorrect. The contents of the array of pages
may be shifted by vnode_pager_generic_getpages(). Likely, the problem
has been hidden by vnode_pager_haspage() limiting the set of pages that
are passed to vnode_pager_generic_getpages() such that a shift never
occurs.

The fix implemented herein is to adjust the pointer to the array of pages
rather than shifting the pages within the array.

MFC after: 3 weeks
Fix suggested by: tegge


163359 14-Oct-2006 alc

Change vnode_pager_addr() such that on returning it distinguishes between
an error returned by VOP_BMAP() and a hole in the file.

Change the callers to vnode_pager_addr() such that they return
VM_PAGER_ERROR when VOP_BMAP fails instead of a zero-filled page.

Reviewed by: tegge
MFC after: 3 weeks


163259 12-Oct-2006 kmacy

sun4v requires TSBs (translation storage buffers) to be contiguous and be
size aligned requiring heavy usage of vm_page_alloc_contig

This change makes vm_page_alloc_contig SMP safe

Approved by: scottl (acting as backup for mentor rwatson)


163210 10-Oct-2006 alc

Distinguish between two distinct kinds of errors from VOP_BMAP() in
vnode_pager_generic_getpages(): (1) that VOP_BMAP() is unsupported by the
underlying file system and (2) an error in performing the VOP_BMAP().
Previously, vnode_pager_generic_getpages() assumed that all errors were
of the first type. If, in fact, the error was of the second type, the
likely outcome was for the process to become permanently blocked on a busy
page.

MFC after: 3 weeks
Reviewed by: tegge


163140 08-Oct-2006 alc

Change vnode_pager_generic_getpages() so that it does not panic if the
given file is sparse. Instead, it zeroes the requested page.

Reviewed by: tegge
PR: kern/98116
MFC after: 3 days


162750 29-Sep-2006 kensmith

Fix two minor style(9) nits in v1.313 which were noticed during an
MFC review. alc@ will be MFCing V1.313 plus style fix to RELENG_6.


161968 03-Sep-2006 alc

Make vm_page_release_contig() static.


161674 27-Aug-2006 alc

Refactor vm_page_sleep_if_busy() so that the test for a busy page is
inlined and a procedure call is made in the rare case, i.e., when it is
necessary to sleep. In this case, inlining the test actually makes the
kernel smaller.


161629 26-Aug-2006 alc

Prevent a call to contigmalloc() that asks for more physical memory than
the machine has from causing a panic.

Submitted by: Michael Plass
PR: 101668
MFC after: 3 days


161597 25-Aug-2006 alc

The return value from vm_pageq_add_new_page() is not used. Eliminate it.


161492 21-Aug-2006 alc

Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and
eliminate their declarations from various source files.


161489 21-Aug-2006 alc

vm_page_zero_idle()'s return value serves no purpose. Eliminate it.


161486 21-Aug-2006 alc

Page flags are reset on (re)allocation. There is no need to clear any
flags except for PG_ZERO in vm_page_free_toq().


161257 13-Aug-2006 alc

Reimplement the page's NOSYNC flag as an object-synchronized instead of a
page queues-synchronized flag. Reduce the scope of the page queues lock in
vm_fault() accordingly.

Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the
scope of the page queues lock. Reviewed by: tegge
Additionally, eliminate an unnecessary dereference in computing the
argument that is passed to vm_object_set_writeable_dirty().


161213 11-Aug-2006 alc

Ensure that the page's new field for object-synchronized flags is always
initialized to zero.

Call vm_page_sleep_if_busy() instead of duplicating its implementation in
vm_page_grab().


161143 10-Aug-2006 alc

Change vm_page_cowfault() so that it doesn't allocate a pre-busied page.


161125 09-Aug-2006 alc

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


161014 06-Aug-2006 alc

Eliminate the acquisition and release of the page queues lock around a call
to vm_page_sleep_if_busy().


161013 06-Aug-2006 alc

Change vm_page_sleep_if_busy() so that it no longer requires the caller to
hold the page queues lock.


161005 05-Aug-2006 alc

Remove a stale comment.


160960 03-Aug-2006 alc

When sleeping on a busy page, use the lock from the containing object
rather than the global page queues lock.


160889 01-Aug-2006 alc

Complete the transition from pmap_page_protect() to pmap_remove_write().
Originally, I had adopted sparc64's name, pmap_clear_write(), for the
function that is now pmap_remove_write(). However, this function is more
like pmap_remove_all() than like pmap_clear_modify() or
pmap_clear_reference(), hence, the name change.

The higher-level rationale behind this change is described in
src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm
trying to clean up and fix our support for execute access.

Reviewed by: marcel@ (ia64)


160585 22-Jul-2006 alc

Export the number of object bypasses and collapses through sysctl.


160561 21-Jul-2006 alc

Retire debug.mpsafevm. None of the architectures supported in CVS require
it any longer.


160540 21-Jul-2006 alc

Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.


160525 20-Jul-2006 alc

Add pmap_clear_write() to the interface between the virtual memory
system's machine-dependent and machine-independent layers. Once
pmap_clear_write() is implemented on all of our supported
architectures, I intend to replace all calls to pmap_page_protect() by
calls to pmap_clear_write(). Why? Both the use and implementation of
pmap_page_protect() in our virtual memory system has subtle errors,
specifically, the management of execute permission is broken on some
architectures. The "prot" argument to pmap_page_protect() should
behave differently from the "prot" argument to other pmap functions.
Instead of meaning, "give the specified access rights to all of the
physical page's mappings," it means "don't take away the specified
access rights from all of the physical page's mappings, but do take
away the ones that aren't specified." However, owing to our i386
legacy, i.e., no support for no-execute rights, all but one invocation
of pmap_page_protect() specifies VM_PROT_READ only, when the intent
is, in fact, to remove only write permission. Consequently, a
faithful implementation of pmap_page_protect(), e.g., ia64, would
remove execute permission as well as write permission. On the other
hand, some architectures that support execute permission have
basically ignored whether or not VM_PROT_EXECUTE is passed to
pmap_page_protect(), e.g., amd64 and sparc64. This change represents
the first step in replacing pmap_page_protect() by the less subtle
pmap_clear_write() that is already implemented on amd64, i386, and
sparc64.

Discussed with: grehan@ and marcel@


160460 18-Jul-2006 rwatson

Fix build of uma_core.c when DDB is not compiled into the kernel by
making uma_zone_sumstat() ifdef DDB, as it's only used with DDB now.

Submitted by: Wolfram Fenske <Wolfram.Fenske at Student.Uni-Magdeburg.DE>


160421 17-Jul-2006 alc

Ensure that vm_object_deallocate() doesn't dereference a stale object
pointer: When vm_object_deallocate() sleeps because of a non-zero
paging in progress count on either object or object's shadow,
vm_object_deallocate() must ensure that object is still the shadow's
backing object when it reawakens. In fact, object may have been
deallocated while vm_object_deallocate() slept. If so, reacquiring
the lock on object can lead to a deadlock.

Submitted by: ups@
MFC after: 3 weeks


160414 16-Jul-2006 rwatson

Remove sysctl_vm_zone() and vm.zone sysctl from 7.x. As of 6.x,
libmemstat(3) is used by vmstat (and friends) to produce more accurate
and more detailed statistics information in a machine-readable way,
and vmstat continues to provide the same text-based front-end.

This change should not be MFC'd.


160236 10-Jul-2006 alc

Set debug.mpsafevm to true on PowerPC. (Now, by default, all architectures
in CVS have debug.mpsafevm set to true.)

Tested by: grehan@


159880 23-Jun-2006 jhb

Move the code to handle the vm.blacklist tunable up a layer into
vm_page_startup(). As a result, we now only lookup the tunable once
instead of looking it up once for every physical page of memory in the
system. This cuts out about a 1 second or so delay in boot on x86
systems. The delay is much larger and more noticable on sun4v apparently.

Reported by: kmacy
MFC after: 1 week


159837 21-Jun-2006 kib

Make the mincore(2) return ENOMEM when requested range is not fully mapped.

Requested by: Bruno Haible <bruno at clisp org>
Reviewed by: alc
Approved by: pjd (mentor)
MFC after: 1 month


159681 17-Jun-2006 alc

Use ptoa(psize) instead of size to compute the end of the mapping in
vm_map_pmap_enter().


159627 15-Jun-2006 ups

Remove mpte optimization from pmap_enter_quick().
There is a race with the current locking scheme and removing
it should have no measurable performance impact.
This fixes page faults leading to panics in pmap_enter_quick_locked()
on amd64/i386.

Reviewed by: alc,jhb,peter,ps


159620 14-Jun-2006 alc

Correct an error in the previous revision that could lead to a panic:
Found mapped cache page. Specifically, if cnt.v_free_count dips below
cnt.v_free_reserved after p_start has been set to a non-NULL value,
then vm_map_pmap_enter() would break out of the loop and incorrectly
call pmap_enter_object() for the remaining address range. To correct
this error, this revision truncates the address range so that
pmap_enter_object() will not map any cache pages.

In collaboration with: tegge@
Reported by: kris@


159475 10-Jun-2006 alc

Enable debug.mpsafevm on arm by default.

Tested by: cognet@


159303 05-Jun-2006 alc

Introduce the function pmap_enter_object(). It maps a sequence of resident
pages from the same object. Use it in vm_map_pmap_enter() to reduce the
locking overhead of premapping objects.

Reviewed by: tegge@


159121 31-May-2006 ps

Fix minidumps to include pages allocated via pmap_map on amd64.
These pages are allocated from the direct map, and were not previous
tracked. This included the vm_page_array and the early UMA bootstrap
pages.

Reviewed by: peter


159054 29-May-2006 tegge

Close race between vmspace_exitfree() and exit1() and races between
vmspace_exitfree() and vmspace_free() which could result in the same
vmspace being freed twice.

Factor out part of exit1() into new function vmspace_exit(). Attach
to vmspace0 to allow old vmspace to be freed earlier.

Add new function, vmspace_acquire_ref(), for obtaining a vmspace
reference for a vmspace belonging to another process. Avoid changing
vmspace refcount from 0 to 1 since that could also lead to the same
vmspace being freed twice.

Change vmtotal() and swapout_procs() to use vmspace_acquire_ref().

Reviewed by: alc


158803 21-May-2006 rwatson

When allocating a bucket to hold a free'd item in UMA fails, don't
report this as an allocation failure for the item type. The failure
will be separately recorded with the bucket type. This my eliminate
high mbuf allocation failure counts under some circumstances, which
can be alarming in appearance, but not actually a problem in
practice.

MFC after: 2 weeks
Reported by: ps, Peter J. Blok <pblok at bsd4all dot org>,
OxY <oxy at field dot hu>,
Gabor MICSKO <gmicskoa at szintezis dot hu>


158525 13-May-2006 alc

Simplify the implementation of vm_fault_additional_pages() based upon the
object's memq being ordered. Specifically, replace repeated calls to
vm_page_lookup() by two simple constant-time operations.

Reviewed by: tegge


158387 10-May-2006 pjd

Use better order here.


158020 25-Apr-2006 alc

Add synchronization to vm_pageq_add_new_page() so that it can be called
safely after kernel initialization. Remove GIANT_REQUIRED.

MFC after: 6 weeks


157920 21-Apr-2006 trhodes

It seems that POSIX would rather ENODEV returned in place of EINVAL when
trying to mmap() an fd that isn't a normal file.

Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Submitted by: fanf


157908 21-Apr-2006 peter

Introduce minidumps. Full physical memory crash dumps are still available
via the debug.minidump sysctl and tunable.

Traditional dumps store all physical memory. This was once a good thing
when machines had a maximum of 64M of ram and 1GB of kvm. These days,
machines often have many gigabytes of ram and a smaller amount of kvm.
libkvm+kgdb don't have a way to access physical ram that is not mapped
into kvm at the time of the crash dump, so the extra ram being dumped
is mostly wasted.

Minidumps invert the process. Instead of dumping physical memory in
in order to guarantee that all of kvm's backing is dumped, minidumps
instead dump only memory that is actively mapped into kvm.

amd64 has a direct map region that things like UMA use. Obviously we
cannot dump all of the direct map region because that is effectively
an old style all-physical-memory dump. Instead, introduce a bitmap
and two helper routines (dump_add_page(pa) and dump_drop_page(pa)) that
allow certain critical direct map pages to be included in the dump.
uma_machdep.c's allocator is the intended consumer.

Dumps are a custom format. At the very beginning of the file is a header,
then a copy of the message buffer, then the bitmap of pages present in
the dump, then the final level of the kvm page table trees (2MB mappings
are expanded into a 4K page mappings), then the sparse physical pages
according to the bitmap. libkvm can now conveniently access the kvm
page table entries.

Booting my test 8GB machine, forcing it into ddb and forcing a dump
leads to a 48MB minidump. While this is a best case, I expect minidumps
to be in the 100MB-500MB range. Obviously, never larger than physical
memory of course.

minidumps are on by default. It would want be necessary to turn them off
if it was necessary to debug corrupt kernel page table management as that
would mess up minidumps as well.

Both minidumps and regular dumps are supported on the same machine.


157815 17-Apr-2006 jhb

Change msleep() and tsleep() to not alter the calling thread's priority
if the specified priority is zero. This avoids a race where the calling
thread could read a snapshot of it's current priority, then a different
thread could change the first thread's priority, then the original thread
would call sched_prio() inside msleep() undoing the change made by the
second thread. I used a priority of zero as no thread that calls msleep()
or tsleep() should be specifying a priority of zero anyway.

The various places that passed 'curthread->td_priority' or some variant
as the priority now pass 0.


157628 10-Apr-2006 pjd

On shutdown try to turn off all swap devices. This way GEOM providers are
properly closed on shutdown.

Requested by: ru
Reviewed by: alc
MFC after: 2 weeks


157443 03-Apr-2006 peter

Remove the unused sva and eva arguments from pmap_remove_pages().


157144 26-Mar-2006 jkoshy

MFP4: Support for profiling dynamically loaded objects.

Kernel changes:

Inform hwpmc of executable objects brought into the system by
kldload() and mmap(), and of their removal by kldunload() and
munmap(). A helper function linker_hwpmc_list_objects() has been
added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
the list of currently loaded kernel modules.

The unused `MAPPINGCHANGE' event has been deprecated in favour
of separate `MAP_IN' and `MAP_OUT' events; this change reduces
space wastage in the log.

Bump the hwpmc's ABI version to "2.0.00". Teach hwpmc(4) to
handle the map change callbacks.

Change the default per-cpu sample buffer size to hold
32 samples (up from 16).

Increment __FreeBSD_version.

libpmc(3) changes:

Update libpmc(3) to deal with the new events in the log file; bring
the pmclog(3) manual page in sync with the code.

pmcstat(8) changes:

Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
(mapfile name), "-q"/"-v" (verbosity control). Option "-k" now
takes a kernel directory as its argument but will also work with
the older invocation syntax.

Rework string handling in pmcstat(8) to use an opaque type for
interned strings. Clean up ELF parsing code and add support for
tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).

Report statistics at the end of a log conversion run depending
on the requested verbosity level.

Reviewed by: jhb, dds (kernel parts of an earlier patch)
Tested by: gallatin (earlier patch)


156420 08-Mar-2006 imp

Remove leading __ from __(inline|const|signed|volatile). They are
obsolete. This should reduce diffs to NetBSD as well.


156415 08-Mar-2006 tegge

Ignore dirty pages owned by "dead" objects.


156225 02-Mar-2006 tegge

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


156224 02-Mar-2006 tegge

Hold extra reference to vm object while cleaning pages.


155884 21-Feb-2006 jhb

Lock the vm_object while checking its type to see if it is a vnode-backed
object that requires Giant in vm_object_deallocate(). This is somewhat
hairy in that if we can't obtain Giant directly, we have to drop the
object lock, then lock Giant, then relock the object lock and verify that
we still need Giant. If we don't (because the object changed to OBJT_DEAD
for example), then we drop Giant before continuing.

Reviewed by: alc
Tested by: kris


155790 17-Feb-2006 tegge

Expand scope of marker to reduce the number of page queue scan restarts.


155784 17-Feb-2006 tegge

Check return value from nonblocking call to vn_start_write().


155737 15-Feb-2006 ups

When the VM needs to allocated physical memory pages (for non interrupt use)
and it has not plenty of free pages it tries to free pages in the cache queue.
Unfortunately freeing a cached page requires the locking of the object that
owns the page. However in the context of allocating pages we may not be able
to lock the object and thus can only TRY to lock the object. If the locking try
fails the cache page can not be freed and is activated to move it out of the way
so that we may try to free other cache pages.

If all pages in the cache belong to objects that are currently locked the
cache queue can be emptied without freeing a single page. This scenario caused
two problems:

1) vm_page_alloc always failed allocation when it tried freeing pages from
the cache queue and failed to do so. However if there are more than
cnt.v_interrupt_free_min pages on the free list it should return pages
when requested with priority VM_ALLOC_SYSTEM. Failure to do so can cause
resource exhaustion deadlocks.

2) Threads than need to allocate pages spend a lot of time cleaning up the
page queue without really getting anything done while the pagedaemon
needs to work overtime to refill the cache.

This change fixes the first problem. (1)

Reviewed by: tegge@


155551 11-Feb-2006 rwatson

Skip per-cpu caches associated with absent CPUs when generating a
memory statistics record stream via sysctl.

MFC after: 3 days


155384 06-Feb-2006 jeff

- Fix silly VI locking that is used to check a single flag. The vnode
lock also protects this flag so it is not necessary.
- Don't rely on v_mount to detect whether or not we've been recycled, use
the more appropriate VI_DOOMED instead.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


155320 04-Feb-2006 alc

Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.


155230 02-Feb-2006 tegge

Adjust old comment (present in rev 1.1) to match changes in rev 1.82.

PR: kern/92509
Submitted by: "Bryan Venteicher" <bryanv@daemoninthecloset.org>


155177 01-Feb-2006 yar

Use off_t for file size passed to vnode_create_vobject().
The former type, size_t, was causing truncation to 32 bits on i386,
which immediately led to undersizing of VM objects backed by
files >4GB. In particular, sendfile(2) was broken for such files.

PR: kern/92243
MFC after: 5 days


155169 01-Feb-2006 jeff

- Install a temporary bandaid in vm_object_reference() that will stop
mtx_assert()s from triggering until I find a real long-term solution.


155128 31-Jan-2006 alc

Change #if defined(DIAGNOSTIC) to KASSERT.


155086 31-Jan-2006 pjd

Add buffer corruption protection (RedZone) for kernel's malloc(9).
It detects both: buffer underflows and buffer overflows bugs at runtime
(on free(9) and realloc(9)) and prints backtraces from where memory was
allocated and from where it was freed.

Tested by: kris


154989 29-Jan-2006 scottl

The change a few years ago of having contigmalloc start its scan at the top
of physical RAM instead of the bottom was a sound idea, but the implementation
left a lot to be desired. Scans would spend considerable time looking at
pages that are above of the address range given by the caller, and multiple
calls (like what happens in busdma) would spend more time on top of that
rescanning the same pages over and over.

Solve this, at least for now, with two simple optimizations. The first is
to not bother scanning high ordered pages that are outside of the provided
address range. Second is to cache the page index from the last successful
operation so that subsequent scans don't have to restart from the top. This
is conditional on the numpages argument being the same or greater between
calls.

MFC After: 2 weeks


154934 27-Jan-2006 jhb

Add a new macro wrapper WITNESS_CHECK() around the witness_warn() function.
The difference between WITNESS_CHECK() and WITNESS_WARN() is that
WITNESS_CHECK() should be used in the places that the return value of
witness_warn() is checked, whereas WITNESS_WARN() should be used in places
where the return value is ignored. Specifically, in a kernel without
WITNESS enabled, WITNESS_WARN() evaluates to an empty string where as
WITNESS_CHECK evaluates to 0. I also updated the one place that was
checking the return value of WITNESS_WARN() to use WITNESS_CHECK.


154929 27-Jan-2006 cognet

Make sure b_vp and b_bufobj are NULL before calling relpbuf(), as it asserts
they are. They should be NULL at this point, except if we're coming from
swapdev_strategy().
It should only affect the case where we're swapping directly on a file over
NFS.


154927 27-Jan-2006 alc

Style: Add blank line after local variable declarations.


154896 27-Jan-2006 alc

Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)


154889 27-Jan-2006 alc

Use the new macros abstracting the page coloring/queues implementation.
(There are no functional changes.)


154849 26-Jan-2006 alc

Plug a leak in the newer contigmalloc() implementation. Specifically, if
a multipage allocation was aborted midway, the pages that were already
allocated were not always returned to the free list.

Submitted by: tegge


154805 25-Jan-2006 jeff

- Avoid calling vm_object_backing_scan() when collapsing an object when
the resident page count matches the object size. We know it fully backs
its parent in this case.

Reviewed by: acl, tegge
Sponsored by: Isilon Systems, Inc.


154799 25-Jan-2006 alc

The previous revision incorrectly changed a switch statement into an if
statement. Specifically, a break statement that previously broke out of
the enclosing switch was not changed. Consequently, the enclosing loop
terminated prematurely.

This could result in "vm_page_insert: page already inserted" panics.

Submitted by: tegge


154788 24-Jan-2006 alc

With the recent changes to the implementation of page coloring, the
the option PQ_NOOPT is used exclusively by vm_pageq.c. Thus, the
include of opt_vmpage.h can be removed from vm_page.h.


154764 24-Jan-2006 alc

In vm_page_set_invalid() invalidate all of the page's mappings as soon as
any part of the page's contents is invalidated.

Submitted by: tegge


154694 22-Jan-2006 alc

Make vm_object_vndeallocate() static. The external calls to it were
eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.


154076 06-Jan-2006 jhb

Reduce the scope of one #ifdef to avoid duplicating a SYSCTL_INT() macro
and trim another unneeded #ifdef (it was just around a macro that is
already conditionally defined).


154035 04-Jan-2006 netchild

Convert the PAGE_SIZE check into a CTASSERT.

Suggested by: jhb


154031 04-Jan-2006 netchild

Prevent divide by zero, use default values in case one of the divisor's
is zero.

Tested by: Randy Bush <randy@psg.com>


153940 31-Dec-2005 netchild

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


153880 30-Dec-2005 pjd

Improve memguard a bit:
- Provide tunable vm.memguard.desc, so one can specify memory type without
changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
vm.memguard.desc, which can be changed to short description of memory
type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
doesn't exist (will be loaded later) or when no memory is allocated yet.
If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
default.


153555 20-Dec-2005 tegge

Don't access fs->first_object after dropping reference to it.
The result could be a missed or extra giant unlock.

Reviewed by: alc


153485 16-Dec-2005 alc

Use sf_buf_alloc() instead of vm_map_find() on exec_map to create the
ephemeral mappings that are used as the source for three copy
operations from kernel space to user space. There are two reasons for
making this change: (1) Under heavy load exec_map can fill up causing
vm_map_find() to fail. When it fails, the nascent process is aborted
(SIGABRT). Whereas, this reimplementation using sf_buf_alloc()
sleeps. (2) Although it is possible to sleep on vm_map_find()'s
failure until address space becomes available (see kmem_alloc_wait()),
using sf_buf_alloc() is faster. Furthermore, the reimplementation
uses a CPU private mapping, avoiding a TLB shootdown on
multiprocessors.

Problem uncovered by: kris@
Reviewed by: tegge@
MFC after: 3 weeks


153385 13-Dec-2005 alc

Assert that the page that is given to vm_page_free_toq() does not have any
managed mappings.


153311 11-Dec-2005 alc

Remove unneeded calls to pmap_remove_all(). The given page is not mapped.

Reviewed by: tegge


153095 04-Dec-2005 alc

Simplify vmspace_dofree().


153068 03-Dec-2005 alc

Eliminate unneeded preallocation at initialization.

Reviewed by: tegge


153060 03-Dec-2005 alc

Eliminate unneeded preallocation at initialization.

Reviewed by: tegge


152630 20-Nov-2005 alc

Eliminate pmap_init2(). It's no longer used.


152224 09-Nov-2005 alc

Reimplement the reclamation of PV entries. Specifically, perform
reclamation synchronously from get_pv_entry() instead of
asynchronously as part of the page daemon. Additionally, limit the
reclamation to inactive pages unless allocation from the PV entry zone
or reclamation from the inactive queue fails. Previously, reclamation
destroyed mappings to both inactive and active pages. get_pv_entry()
still, however, wakes up the page daemon when reclamation occurs. The
reason being that the page daemon may move some pages from the active
queue to the inactive queue, making some new pages available to future
reclamations.

Print the "reclaiming PV entries" message at most once per minute, but
don't stop printing it after the fifth time. This way, we do not give
the impression that the problem has gone away.

Reviewed by: tegge


152178 08-Nov-2005 alc

If a physical page is mapped by two or more virtual addresses, transmitted
by the zero-copy sockets method, and written to before the transmission
completes, we need to destroy all of the existing mappings to the page,
not just the one that we fault on. Otherwise, the mappings will no longer
be to the same page and changes made through one of the mappings will not
be visible through the others.

Observed by: tegge


151951 01-Nov-2005 ps

Rate limit vnode_pager_putpages printfs to once a second.


151918 01-Nov-2005 alc

Consider the zero-copy transmission of a page that was wired by mlock(2).
If a copy-on-write fault occurs on the page, the new copy should inherit
a part of the original page's wire count.

Submitted by: tegge
MFC after: 1 week


151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


151558 22-Oct-2005 alc

Use of the ZERO_COPY_SOCKETS options can result in an unusual state that
vm_object_backing_scan() was not written to handle. Specifically, a wired
page within a backing object that is shadowed by a page within the shadow
object. Handle this state by removing the wired page from the backing
object. The wired page will be freed by socow_iodone().

Stop masking errors: If a page is being freed by vm_object_backing_scan(),
assert that it is no longer mapped rather than quietly destroying any
mappings.

Tested by: Harald Schmalzbauer


151526 20-Oct-2005 rwatson

Change format string for u_int64_t to %ju from %llu, in order to use the
correct format string on 64-bit systems.

Pointed out by: pjd


151516 20-Oct-2005 rwatson

Add a "show uma" command to DDB, which prints out the current stats for
available UMA zones. Quite useful for post-mortem debugging of memory
leaks without a dump device configured on a panicked box.

MFC after: 2 weeks


151252 12-Oct-2005 dds

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


151104 08-Oct-2005 des

As alc pointed out to me, vm_page.c 1.305 was incomplete: uma_startup()
still uses the constant UMA_BOOT_PAGES. Change it to accept boot_pages
as an additional argument.

MFC after: 2 weeks


150926 04-Oct-2005 dds

Update the vnode's access time after an mmap operation on it.
Before this change a copy operation with cp(1) would not update the
file access times.

According to the POSIX mmap(2) documentation: the st_atime field
of the mapped file may be marked for update at any time between the
mmap() call and the corresponding munmap() call. The initial read
or write reference to a mapped region shall cause the file's st_atime
field to be marked for update if it has not already been marked for
update.


150727 29-Sep-2005 jhb

Trim a couple of unneeded includes.


150418 21-Sep-2005 cognet

Make sure we have a bufobj before calling bstrategy().
I'm not sure this is the right thing to do, but at least I don't panic
anymore when swapping on a NFS file without using md(4).

X-MFC after: proper review


150397 20-Sep-2005 peter

Remove unused (but initialized) variable 'objsize' from vm_mmap()


149900 09-Sep-2005 alc

Introduce a new lock for the purpose of synchronizing access to the
UMA boot pages.

Disable recursion on the general UMA lock now that startup_alloc() no
longer uses it.

Eliminate the variable uma_boot_free. It serves no purpose.

Note: This change eliminates a lock-order reversal between a system
map mutex and the UMA lock. See
http://sources.zabbadoz.net/freebsd/lor.html#109 for details.

MFC after: 3 days


149839 07-Sep-2005 alc

Eliminate an incorrect cast.


149768 03-Sep-2005 alc

Pass a value of type vm_prot_t to pmap_enter_quick() so that it determine
whether the mapping should permit execute access.


149035 13-Aug-2005 kan

Do not use vm_pager_init() to initialize vnode_pbuf_freecnt variable.
vm_pager_init() is run before required nswbuf variable has been set
to correct value. This caused system to run with single pbuf available
for vnode_pager. Handle both cluster_pbuf_freecnt and vnode_pbuf_freecnt
variable in the same way.

Reported by: ade
Obtained from: alc
MFC after: 2 days


148997 12-Aug-2005 tegge

Check for marker pages when scanning active and inactive page queues.

Reviewed by: alc


148985 12-Aug-2005 des

Introduce the vm.boot_pages tunable and sysctl, which controls the number
of pages reserved to bootstrap the kernel memory allocator.

MFC after: 2 weeks


148909 10-Aug-2005 tegge

Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE
due to the vm object being locked.

When a process writes large amounts of data to a file, the vm object associated
with that file can contain most of the physical pages on the machine. If the
process is preempted while holding the lock on the vm object, pagedaemon would
be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE
to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to
other vm objects.

Temporarily unlock the page queues lock while locking vm objects to avoid lock
order violation. Detect and handle relevant page queue changes.

This change depends on both the lock portion of struct vm_object and normal
struct vm_page being type stable.

Reviewed by: alc


148875 08-Aug-2005 ssouhlal

Use atomic operations on runningbufspace.

PR: kern/84318
Submitted by: ade
MFC after: 3 days


148691 04-Aug-2005 rwatson

Don't perform a nested include of opt_vmpage.h if LIBMEMSTAT is defined,
as opt_vmpage.h will not be available to user space library builds. A
similar existing check is present for KLD_MODULE for similar reasons.

MFC after: 3 days


148690 04-Aug-2005 rwatson

Wrap inlines in uma_int.h in #ifdef _KERNEL so that uma_int.h can be
used from memstat_uma.c for the purposes of kvm access without lots
of additional unsafe includes.

MFC after: 3 days


148371 25-Jul-2005 rwatson

Rename UMA_MAX_NAME to UTH_MAX_NAME, since it's a maximum in the
monitoring API, which might or might not be the same as the internal
maximum (currently none).

Export flag information on UMA zones -- in particular, whether or
not this is a secondary zone, and so the keg free count should be
considered in that light.

MFC after: 1 day


148200 20-Jul-2005 alc

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


148194 20-Jul-2005 rwatson

Further UMA statistics related changes:

- Add a new uma_zfree_internal() flag, ZFREE_STATFREE, which causes it to
to update the zone's uz_frees statistic. Previously, the statistic was
updated unconditionally.

- Use the flag in situations where a "real" free occurs: i.e., one where
the caller is freeing an allocated item, to be differentiated from
situations where uma_zfree_internal() is used to tear down the item
during slab teardown in order to invoke its fini() method. Also use
the flag when UMA is freeing its internal objects.

- When exchanging a bucket with the zone from the per-CPU cache when
freeing an item, flush cache statistics back to the zone (since the
zone lock and critical section are both held) to match the allocation
case.

MFC after: 3 days


148193 20-Jul-2005 alc

Eliminate an incorrect (and unnecessary) cast.


148079 16-Jul-2005 rwatson

Use mp_maxid in preference to MAXCPU when creating exports of UMA
per-CPU cache statistics. UMA sizes the cache array based on the
number of CPUs at boot (mp_maxid + 1), and iterating based on MAXCPU
could read off the end of the array (into the next zone).

Reported by: yongari
MFC after: 1 week


148078 16-Jul-2005 rwatson

Improve canonicalization of copyrights. Order copyrights by order of
assertion (jeff, bmilekic, rwatson).

Suggested ages ago by: bde
MFC after: 1 week


148077 16-Jul-2005 rwatson

Move the unlocking of the zone mutex in sysctl_vm_zone_stats() so that
it covers the following of the uc_alloc/freebucket cache pointers.
Originally, I felt that the race wasn't helped by holding the mutex,
hence a comment in the code and not holding it across the cache access.
However, it does improve consistency, as while it doesn't prevent
bucket exchange, it does prevent bucket pointer invalidation. So a
race in gathering cache free space statistics still can occur, but not
one that follows an invalid bucket pointer, if the mutex is held.

Submitted by: yongari
MFC after: 1 week


148072 16-Jul-2005 silby

Increase the flags field for kegs from a 16 to a 32 bit value;
we have exhausted all 16 flags.


148070 15-Jul-2005 rwatson

Track UMA(9) allocation failures by zone, and export via sysctl.

Requested by: victor cruceru <victor dot cruceru at gmail dot com>
MFC after: 1 week


148014 14-Jul-2005 jhb

Convert a remaining !fs.map->system_map to
fs.first_object->flags & OBJ_NEEDGIANT test that was missed in an earlier
revision. This fixes mutex assertion failures in the debug.mpsafevm=0
case.

Reported by: ps
MFC after: 3 days


147996 14-Jul-2005 rwatson

Introduce a new sysctl, vm.zone_stats, which exports UMA(9) allocator
statistics via a binary structure stream:

- Add structure 'uma_stream_header', which defines a stream version,
definition of MAXCPUs used in the stream, and the number of zone
records in the stream.

- Add structure 'uma_type_header', which defines the name, alignment,
size, resource allocation limits, current pages allocated, preferred
bucket size, and central zone + keg statistics.

- Add structure 'uma_percpu_stat', which, for each per-CPU cache,
includes the number of allocations and frees, as well as the number
of free items in the cache.

- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUs uma_percpu_stat structures holding
per-CPU allocation information. Typical values of MAXCPU will be
1 (UP compiled kernel) and 16 (SMP compiled kernel).

This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.

While here, also export the number of UMA zones as a sysctl
vm.uma_count, in order to assist in sizing user swpace buffers to
receive the stream.

A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. This change directly
supports converting netstat(1)'s "-mb" mode to using UMA-sourced stats
rather than separately maintained mbuf allocator statistics.

MFC after: 1 week


147995 14-Jul-2005 rwatson

In addition to tracking allocs in the zone, also track frees. Add
a zone free counter, as well as a cache free counter.

MFC after: 1 week


147994 14-Jul-2005 rwatson

In an earlier world order, UMA would flush per-CPU statistics to the
zone whenever it was moving buckets between the zone and the cache,
or when coalescing statistics across the CPU. Remove flushing of
statistics to the zone when coalescing statistics as part of sysctl,
as we won't be running on the right CPU to write to the cache
statistics.

Add a missed gathering of statistics: when uma_zalloc_internal()
does a special case allocation of a single item, make sure to update
the zone statistics to represent this. Previously this case wasn't
accounted for in user-visible statistics.

MFC after: 1 week


147615 26-Jun-2005 silby

Change the panic in trash_ctor into just a printf for now. Once the reports
of panics in trash_ctor relating to mbufs have been examined and a fix
found, this will be turned back into a panic.

Approved by: re (rwatson)


147422 16-Jun-2005 alc

Increase UMA_BOOT_PAGES to prevent a crash during initialization. See
http://docs.FreeBSD.org/cgi/mid.cgi?42AD8270.8060906 for a detailed
description of the crash.

Reported by: Eric Anderson
Approved by: re (scottl)
MFC after: 3 days


147283 11-Jun-2005 green

The new contigmalloc(9) has a bad degenerate case where there were
many regions checked again and again despite knowing the pages
contained were not usable and only satisfied the alignment constraints
This case was compounded, especially for large allocations, by the
practice of looping from the top of memory so as to keep out of the
important low-memory regions. While the old contigmalloc(9) has the
same problem, it is not as noticeable due to looping from the low
memory to high.

This degenerate case is fixed, as well as reversing the sense of the
rest of the loops within it, to provide a tremendous speed increase.
This makes the best case O(n * VM overhead) much more likely than the
worst case O(4 * VM overhead). For comparison, the worst case for old
contigmalloc would be O(5 * VM overhead) in addition to its strategy
of turning used memory into free being highly pessimal.

Also, fix a bug that in practice most likely couldn't have been triggered,
int the new contigmalloc(9): it walked backwards from the end of memory
without accounting for how many pages it needed. Potentially, nonexistant
pages could have been mapped. This hasn't occurred because the kernel
generally requests as its first contigmalloc(9) a single page.

Reported by: Nicolas Dehaine <nicko@stbernard.com>, wes
MFC After: 1 month
More testing by: Nicolas Dehaine <nicko@stbernard.com>, wes


147262 10-Jun-2005 alc

Add a comment to the effect that fictitious pages do not require the
initialization of their machine-dependent fields.


147217 10-Jun-2005 alc

Introduce a procedure, pmap_page_init(), that initializes the
vm_page's machine-dependent fields. Use this function in
vm_pageq_add_new_page() so that the vm_page's machine-dependent and
machine-independent fields are initialized at the same time.

Remove code from pmap_init() for initializing the vm_page's
machine-dependent fields.

Remove stale comments from pmap_init().

Eliminate the Boolean variable pmap_initialized from the alpha, amd64,
i386, and ia64 pmap implementations. Its use is no longer required
because of the above changes and earlier changes that result in physical
memory that is being mapped at initialization time being mapped without
pv entries.

Tested by: cognet, kensmith, marcel


146727 28-May-2005 alc

Update some comments to reflect the change from spl-based to lock-based
synchronization.


146554 23-May-2005 ups

Use low level constructs borrowed from interrupt threads to wait for
work in proc0.
Remove the TDP_WAKEPROC0 workaround.


146501 22-May-2005 alc

Swap in can occur safely without Giant. Release Giant on entry to
scheduler().


146484 22-May-2005 alc

Remove GIANT_REQUIRED from swapout_procs().


146459 20-May-2005 alc

Reduce the number of times that we acquire and release locks in
swap_pager_getpages().

MFC after: 1 week


146367 19-May-2005 alc

Remove calls to spl*().


146363 19-May-2005 alc

Remove a stale comment concerning spl* usage.


146355 18-May-2005 alc

Update some comments to reflect the change from spl-based to lock-based
synchronization.


146351 18-May-2005 alc

Remove calls to spl*().


146350 18-May-2005 alc

Revert revision 1.270: swp_pager_async_iodone() need not perform
VM_LOCK_GIANT().

Discussed with: jeff


146340 18-May-2005 bz

Correct 32 vs 64 bit signedness issues.

Approved by: pjd (mentor)
MFC after: 2 weeks


146126 12-May-2005 grehan

The final test in unlock_and_deallocate() to determine if GIANT needs to be
unlocked wasn't updated to check for OBJ_NEEDGIANT. This caused a WITNESS
panic when debug_mpsafevm was set to 0.

Approved by: jeffr


146017 08-May-2005 marcel

Enable debug_mpsafevm on ia64 due to the severe functional regression
caused by recent locking changes when it's off. Revert the logic to
trim down the conditional.

Clued-in by: alc@


145888 04-May-2005 jeff

- We need to inhert the OBJ_NEEDGIANT flag from the original object in
vm_object_split().

Spotted by: alc


145826 03-May-2005 jeff

- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the
underlying vnode requires Giant.
- In vm_fault only acquire Giant if the underlying object has NEEDSGIANT
set.
- In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.


145788 02-May-2005 alc

Remove GIANT_REQUIRED from vmspace_exec().

Prodded by: jeff


145699 30-Apr-2005 jeff

- VM_LOCK_GIANT in the swap pager's iodone routine as VFS will soon call it
without Giant.

Sponsored by: Isilon Systems, Inc.


145686 29-Apr-2005 rwatson

Modify UMA to use critical sections to protect per-CPU caches, rather than
mutexes, which offers lower overhead on both UP and SMP. When allocating
from or freeing to the per-cpu cache, without INVARIANTS enabled, we now
no longer perform any mutex operations, which offers a 1%-3% performance
improvement in a variety of micro-benchmarks. We rely on critical
sections to prevent (a) preemption resulting in reentrant access to UMA on
a single CPU, and (b) migration of the thread during access. In the event
we need to go back to the zone for a new bucket, we release the critical
section to acquire the global zone mutex, and must re-acquire the critical
section and re-evaluate which cache we are accessing in case migration has
occured, or circumstances have changed in the current cache.

Per-CPU cache statistics are now gathered lock-free by the sysctl, which
can result in small races in statistics reporting for caches.

Reviewed by: bmilekic, jeff (somewhat)
Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others


145584 27-Apr-2005 jeff

- Pass the ISOPEN flag to namei so filesystems will know we're about to
open them or otherwise access the data.


145530 25-Apr-2005 kris

Add the vm.exec_map_entries tunable and read-only sysctl, which controls
the number of entries in exec_map (maximum number of simultaneous execs
that can be handled by the kernel). The default value of 16 is
insufficient on heavily loaded machines (particularly SMP machines), and
if it is exceeded then executing further processes will generate a SIGABRT.

This is a workaround until a better solution can be implemented.

Reviewed by: alc
MFC after: 3 days


145144 16-Apr-2005 des

Unbreak the build on 64-bit architectures.


145127 15-Apr-2005 jhb

Add a vm.blacklist tunable which can hold a space or comma seperated list
of physical addresses. The pages containing these physical addresses will
not be added to the free list and thus will effectively be ignored by the
VM system. This is mostly useful for the case when one knows of specific
physical addresses that have bit errors (such as from a memtest run) so
that one can blacklist the bad pages while waiting for the new sticks of
RAM to arrive. The physical addresses of any ignored pages are listed in
the message buffer as well.


145076 14-Apr-2005 csjp

Move MAC check_vnode_mmap entry point out from being exclusive to
MAP_SHARED so that the entry point gets executed un-conditionally.
This may be useful for security policies which want to perform access
control checks around run-time linking.

-add the mmap(2) flags argument to the check_vnode_mmap entry point
so that we can make access control decisions based on the type of
mapped object.
-update any dependent API around this parameter addition such as
function prototype modifications, entry point parameter additions
and the inclusion of sys/mman.h header file.
-Change the MLS, BIBA and LOMAC security policies so that subject
domination routines are not executed unless the type of mapping is
shared. This is done to maintain compatibility between the old
vm_mmap_vnode(9) and these policies.

Reviewed by: rwatson
MFC after: 1 month


144970 12-Apr-2005 jhb

Tidy vcnt() by moving a duplicated line above #ifdef and removing a useless
variable.


144635 04-Apr-2005 jhb

Flip the switch and turn mpsafevm on by default for sparc64.

Approved by: alc


144610 03-Apr-2005 jeff

- Don't NULL the vnode's v_object pointer until after the object is torn
down. If we have dirty pages, the putpages routine will need to know
what the vnode's object is so that it may write out dirty pages.

Pointy hat: phk
Found by: obrien


144501 01-Apr-2005 jhb

- Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.

Reviewed by: alc, arch@


144367 31-Mar-2005 jeff

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


144322 30-Mar-2005 alc

Eliminate (now) unnecessary acquisition and release of the global page
queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY
flag and busy field are synchronized by the containing object's lock.

Testing the page's hold_count and wire_count in vm_object_backing_scan()'s
OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held
or wired pages cannot be migrated to the shadow object.

Reviewed by: tegge


143821 18-Mar-2005 das

Move the swap_zone == NULL check earlier (i.e. before we dereference
the pointer.)

Found by: Coverity Prevent analysis tool


143745 17-Mar-2005 jeff

- Don't lock the vnode interlock in vm_object_set_writeable_dirty() if
we've already set the object flags.

Reviewed by: alc


143646 15-Mar-2005 jeff

- In vm_page_insert() hold the backing vnode when the first page
is inserted.
- In vm_page_remove() drop the backing vnode when the last page
is removed.
- Don't check the vnode to see if it must be reclaimed on every
call to vm_page_free_toq() as we only check it now when it is
actually required. This saves us two lock operations per call.

Sponsored by: Isilon Systems, Inc.


143559 14-Mar-2005 jeff

- Don't directly adjust v_usecount, use vref() instead.

Sponsored by: Isilon Systems, Inc.


143554 14-Mar-2005 jeff

- Retire OLOCK and OWANT. All callers hold the vnode lock when creating
a vnode object. There has been an assert to prove this for some time.

Sponsored by: Isilon Systems, Inc.


143505 13-Mar-2005 jeff

- Don't acquire the vnode lock in destroy_vobject, assert that it has
already been acquired by the caller.

Sponsored by: Isilon Systems, Inc.


142367 24-Feb-2005 alc

Revert the first part of revision 1.114 and modify the second part. On
architectures implementing uma_small_alloc() pages do not necessarily
belong to the kmem object.


142079 19-Feb-2005 phk

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


141991 16-Feb-2005 bmilekic

Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h


141983 16-Feb-2005 bmilekic

Make UMA set the overloaded page->object back to kmem_object for
UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly
came from kmem_map for those two. Previously it would set it back
to NULL for UMA_ZONE_REFCNT zones and although this was probably not
fatal, it added MORE code for no reason.


141955 15-Feb-2005 bmilekic

Rather than overloading the page->object field like UMA does, use instead
an unused pageq queue reference in the page structure to stash a pointer
to the MemGuard FIFO. Using the page->object field caused problems
because when vm_map_protect() was called the second time to set
VM_PROT_DEFAULT back onto a set of pages in memguard_map, the protection
in the VM would be changed but the PMAP code would lazily not restore
the PG_RW bit on the underlying pages right away (see pmap_protect()).
So when a page fault finally occured and the VM noticed the faulting
address corresponds to a page that _does_ have write access now, it
would then call into PMAP to set back PG_RW (i386 case being discussed
here). However, before it got to do that, an assertion on the object
lock not being owned would get triggered, as the object of the faulting
page would need to be locked but was overloaded by MemGuard. This is
precisely why MemGuard cannot overload page->object.

Submitted by: Alan Cox (alc@)


141696 11-Feb-2005 phk

sysctl node vm.stats can not be static (for ia64 reasons).


141670 10-Feb-2005 bmilekic

Implement support for buffers larger than PAGE_SIZE in MemGuard. Adds
a little bit of complexity but performance requirements lacking (this is
a debugging allocator after all), it's really not too bad (still
only 317 lines).

Also add an additional check to help catch really weird 3-threads-involved
races: make memguard_free() write to the first page handed back, always,
before it does anything else.

Note that there is still a problem in VM+PMAP (specifically with
vm_map_protect) w.r.t. MemGuard uses it, but this will be fixed shortly
and this change stands on its own.


141630 10-Feb-2005 phk

Make three SYSCTL_NODEs static


141629 10-Feb-2005 phk

Make npages static and const.


141247 04-Feb-2005 ssouhlal

Set the scheduling class of the zeroidle thread to PRI_IDLE.

Reviewed by: jhb
Approved by: grehan (mentor)
MFC after: 1 week


141068 30-Jan-2005 alc

Update the text of an assertion to reflect changes made in revision 1.148.
Submitted by: tegge

Eliminate an unnecessary, temporary increment of the backing object's
reference count in vm_object_qcollapse(). Reviewed by: tegge


140929 28-Jan-2005 phk

Move the contents of vop_stddestroyvobject() to the new vnode_pager
function vnode_destroy_vobject().

Make the new function zero the vp->v_object pointer so we can tell
if a call is missing.


140782 25-Jan-2005 phk

Don't use VOP_GETVOBJECT, use vp->v_object directly.


140767 24-Jan-2005 phk

Move the body of vop_stdcreatevobject() over to the vnode_pager under
the name Sande^H^H^H^H^Hvnode_create_vobject().

Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.

Call that new function from vop_stdcreatevobject().

Make vnode_pager_alloc() private now that its only user came home.


140734 24-Jan-2005 phk

Kill the VV_OBJBUF and test the v_object for NULL instead.


140723 24-Jan-2005 jeff

- Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
where giant is only held because vfs requires it.

Sponsored By: Isilon Systems, Inc.


140622 22-Jan-2005 alc

Guard against address wrap in kernacc(). Otherwise, a program accessing a
bad address range through /dev/kmem can panic the machine.

Submitted by: Mark W. Krentel
Reported by: Kris Kennaway
MFC after: 1 week


140605 22-Jan-2005 bmilekic

s/round_page/trunc_page/g

I meant trunc_page. It's only a coincidence this hasn't caused
problems yet.

Pointed out by: Antoine Brodin <antoine.brodin@laposte.net>


140587 21-Jan-2005 bmilekic

Bring in MemGuard, a very simple and small replacement allocator
designed to help detect tamper-after-free scenarios, a problem more
and more common and likely with multithreaded kernels where race
conditions are more prevalent.

Currently MemGuard can only take over malloc()/realloc()/free() for
particular (a) malloc type(s) and the code brought in with this
change manually instruments it to take over M_SUBPROC allocations
as an example. If you are planning to use it, for now you must:

1) Put "options DEBUG_MEMGUARD" in your kernel config.
2) Edit src/sys/kern/kern_malloc.c manually, look for
"XXX CHANGEME" and replace the M_SUBPROC comparison with
the appropriate malloc type (this might require additional
but small/simple code modification if, say, the malloc type
is declared out of scope).
3) Build and install your kernel. Tune vm.memguard_divisor
boot-time tunable which is used to scale how much of kmem_map
you want to allott for MemGuard's use. The default is 10,
so kmem_size/10.

ToDo:
1) Bring in a memguard(9) man page.
2) Better instrumentation (e.g., boot-time) of MemGuard taking
over malloc types.
3) Teach UMA about MemGuard to allow MemGuard to override zone
allocations too.
4) Improve MemGuard if necessary.

This work is partly based on some old patches from Ian Dowse.


140439 18-Jan-2005 alc

Add checks to vm_map_findspace() to test for address wrap. The conditions
where this could occur are very rare, but possible.

Submitted by: Mark W. Krentel
MFC after: 2 weeks


140319 15-Jan-2005 alc

Consider three objects, O, BO, and BBO, where BO is O's backing object
and BBO is BO's backing object. Now, suppose that O and BO are being
collapsed. Furthermore, suppose that BO has been marked dead
(OBJ_DEAD) by vm_object_backing_scan() and that either
vm_object_backing_scan() has been forced to sleep due to encountering
a busy page or vm_object_collapse() has been forced to sleep due to
memory allocation in the swap pager. If vm_object_deallocate() is
then called on BBO and BO is BBO's only shadow object,
vm_object_deallocate() will collapse BO and BBO. In doing so, it adds
a necessary temporary reference to BO. If this collapse also sleeps
and the prior collapse resumes first, the temporary reference will
cause vm_object_collapse to panic with the message "backing_object %p
was somehow re-referenced during collapse!"

Resolve this race by changing vm_object_deallocate() such that it
doesn't collapse BO and BBO if BO is marked dead. Once O and BO are
collapsed, vm_object_collapse() will attempt to collapse O and BBO.
So, vm_object_deallocate() on BBO need do nothing.

Reported by: Peter Holm on 20050107
URL: http://www.holm.cc/stress/log/cons102.html

In collaboration with: tegge@
Candidate for RELENG_4 and RELENG_5
MFC after: 2 weeks


140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


140031 11-Jan-2005 bmilekic

While we want the recursion protection for the bucket zones so that
recursion from the VM is handled (and the calling code that allocates
buckets knows how to deal with it), we do not want to prevent allocation
from the slab header zones (slabzone and slabrefzone) if uk_recurse is
not zero for them. The reason is that it could lead to NULL being
returned for the slab header allocations even in the M_WAITOK
case, and the caller can't handle that (this is also explained in a
comment with this commit).

The problem analysis is documented in our mailing lists:
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current

(see entire thread for proper context).

Crash dump data provided by: Peter Holm <peter@holm.cc>


139996 10-Jan-2005 stefanf

ISO C requires at least one element in an initialiser list.


139921 08-Jan-2005 alc

Move the acquisition and release of the page queues lock outside of a loop
in vm_object_split() to avoid repeated acquisition and release.


139835 07-Jan-2005 alc

Transfer responsibility for freeing the page taken from the cache
queue and (possibly) unlocking the containing object from
vm_page_alloc() to vm_page_select_cache(). Recent optimizations to
vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and
pmap_enter_quick() have resulted in panic()s because vm_page_alloc()
mistakenly unlocked objects that had not been locked by
vm_page_select_cache().

Reported by: Peter Holm and Kris Kennaway


139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


139779 06-Jan-2005 alc

Revise the part of vm_pageout_scan() that moves pages from the cache
queue to the free queue. With this change, if a page from the cache
queue belongs to a locked object, it is simply skipped over rather
than moved to the inactive queue.


139629 03-Jan-2005 phk

When allocating bio's in the swap_pager use M_WAITOK since the
alternative is much worse.


139495 31-Dec-2004 alc

Assert that page allocations during an interrupt specify
VM_ALLOC_INTERRUPT.

Assert that pages removed from the cache queue are not busy.


139391 29-Dec-2004 alc

Access to the page's busy field is (now) synchronized by the containing
object's lock. Therefore, the assertion that the page queues lock is held
can be removed from vm_page_io_start().


139338 27-Dec-2004 alc

Note that access to the page's busy count is synchronized by the containing
object's lock.


139332 26-Dec-2004 alc

Assert that the vm object is locked on entry to vm_page_sleep_if_busy();
remove some unneeded code.


139318 26-Dec-2004 bmilekic

Add my copyright and update Jeff's copyright on UMA source files,
as per his request.

Discussed with: Jeffrey Roberson


139296 25-Dec-2004 phk

fix comment


139265 24-Dec-2004 alc

Continue the transition from synchronizing access to the page's PG_BUSY
flag and busy field with the global page queues lock to synchronizing their
access with the containing object's lock. Specifically, acquire the
containing object's lock before reading the page's PG_BUSY flag and busy
field in vm_fault().

Reviewed by: tegge@


139241 23-Dec-2004 alc

Modify pmap_enter_quick() so that it expects the page queues to be locked
on entry and it assumes the responsibility for releasing the page queues
lock if it must sleep.

Remove a bogus comment from pmap_enter_quick().

Using the first change, modify vm_map_pmap_enter() so that the page queues
lock is acquired and released once, rather than each time that a page
is mapped.


138986 17-Dec-2004 alc

Eliminate another unnecessary call to vm_page_busy(). (See revision 1.333
for a detailed explanation.)


138981 17-Dec-2004 alc

Enable debug.mpsafevm by default on alpha.


138897 15-Dec-2004 alc

In the common case, pmap_enter_quick() completes without sleeping.
In such cases, the busying of the page and the unlocking of the
containing object by vm_map_pmap_enter() and vm_fault_prefault() is
unnecessary overhead. To eliminate this overhead, this change
modifies pmap_enter_quick() so that it expects the object to be locked
on entry and it assumes the responsibility for busying the page and
unlocking the object if it must sleep. Note: alpha, amd64, i386 and
ia64 are the only implementations optimized by this change; arm,
powerpc, and sparc64 still conservatively busy the page and unlock the
object within every pmap_enter_quick() call.

Additionally, this change is the first case where we synchronize
access to the page's PG_BUSY flag and busy field using the containing
object's lock rather than the global page queues lock. (Modifications
to the page's PG_BUSY flag and busy field have asserted both locks for
several weeks, enabling an incremental transition.)


138538 08-Dec-2004 alc

With the removal of kern/uipc_jumbo.c and sys/jumbo.h,
vm_object_allocate_wait() is not used. Remove it.


138531 07-Dec-2004 alc

Almost nine years ago, when support for 1TB files was introduced in
revision 1.55, the address parameter to vnode_pager_addr() was changed
from an unsigned 32-bit quantity to a signed 64-bit quantity. However,
an out-of-range check on the address was not updated. Consequently,
memory-mapped I/O on files greater than 2GB could cause a kernel panic.
Since the address is now a signed 64-bit quantity, the problem resolution
is simply to remove a cast.

Reviewed by: bde@ and tegge@
PR: 73010
MFC after: 1 week


138406 05-Dec-2004 alc

Correct a sanity check in vnode_pager_generic_putpages(). The cast used
to implement the sanity check should have been changed when we converted
the implementation of vm_pindex_t from 32 to 64 bits. (Thus, RELENG_4 is
not affected.) The consequence of this error would be a legimate write to
an extremely large file being treated as an errant attempt to write meta-
data.

Discussed with: tegge@


138129 27-Nov-2004 das

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


138114 26-Nov-2004 cognet

Remove useless casts.


138066 24-Nov-2004 delphij

Try to close a potential, but serious race in our VM subsystem.

Historically, our contigmalloc1() and contigmalloc2() assumes
that a page in PQ_CACHE can be unconditionally reused by busying
and freeing it. Unfortunatelly, when object happens to be not
NULL, the code will set m->object to NULL and disregard the fact
that the page is actually in the VM page bucket, resulting in
page bucket hash table corruption and finally, a filesystem
corruption, or a 'page not in hash' panic.

This commit has borrowed the idea taken from DragonFlyBSD's fix
to the VM fix by Matthew Dillon[1]. This version of patch will
do the following checks:

- When scanning pages in PQ_CACHE, check hold_count and
skip over pages that are held temporarily.
- For pages in PQ_CACHE and selected as candidate of being
freed, check if it is busy at that time.

Note: It seems that this is might be unrelated to kern/72539.

Obtained from: DragonFlyBSD, sys/vm/vm_contig.c,v 1.11 and 1.12 [1]
Reminded by: Matt Dillon
Reworked by: alc
MFC After: 1 week


137910 20-Nov-2004 das

Disable U area swapping and remove the routines that create, destroy,
copy, and swap U areas.

Reviewed by: arch@


137726 15-Nov-2004 phk

Make VOP_BMAP return a struct bufobj for the underlying storage device
instead of a vnode for it.

The vnode_pager does not and should not have any interest in what
the filesystem uses for backend.

(vfs_cluster doesn't use the backing store argument.)


137725 15-Nov-2004 phk

Add pbgetbo()/pbrelbo() lighter weight versions of pbgetvp()/pbrelvp().


137723 15-Nov-2004 phk

More kasserts.


137722 15-Nov-2004 phk

style polishing.


137721 15-Nov-2004 phk

Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.


137720 15-Nov-2004 phk

expect the caller to have called pbrelvp() if necessary.


137719 15-Nov-2004 phk

Explicitly call pbrelvp()


137457 09-Nov-2004 phk

Improve readability with a bunch of typedefs for the pager ops.

These can also be used for prototypes in the pagers.


137393 08-Nov-2004 des

#include <vm/vm_param.h> instead of <machine/vmparam.h> (the former
includes the latter, but also declares variables which are defined
in kern/subr_param.c).

Change som VM parameters from quad_t to unsigned long. They refer to
quantities (size limits for text, heap and stack segments) which must
necessarily be smaller than the size of the address space, so long is
adequate on all platforms.

MFC after: 1 week


137324 06-Nov-2004 alc

Eliminate an unnecessary atomic operation. Articulate the rationale in
a comment.


137309 06-Nov-2004 rwatson

Abstract the logic to look up the uma_bucket_zone given a desired
number of entries into bucket_zone_lookup(), which helps make more
clear the logic of consumers of bucket zones.

Annotate the behavior of bucket_init() with a comment indicating
how the various data structures, including the bucket lookup tables,
are initialized.


137306 06-Nov-2004 phk

Remove dangling variable


137305 06-Nov-2004 rwatson

Annotate what bucket_size[] array does; staticize since it's used only
in uma_core.c.


137299 06-Nov-2004 das

Fix the last known race in swapoff(), which could lead to a spurious panic:

swapoff: failed to locate %d swap blocks

The race occurred because putpages() can block between the time it
allocates swap space and the time it updates the swap metadata to
associate that space with a vm_object, so swapoff() would complain
about the temporary inconsistency. I hoped to fix this by making
swp_pager_getswapspace() and swp_pager_meta_build() a single atomic
operation, but that proved to be inconvenient. With this change,
swapoff() simply doesn't attempt to be so clever about detecting when
all the pageout activity to the target device should have drained.


137297 06-Nov-2004 alc

Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc()
because this call is only needed to wake threads that slept when they
discovered a dead object connected to a vnode. To eliminate unnecessary
calls to wakeup() by vnode_pager_dealloc(), introduce a new flag,
OBJ_DISCONNECTWNT.

Reviewed by: tegge@


137268 05-Nov-2004 jhb

- Set the priority of the page zeroing thread using sched_prio() when the
thread is created rather than adjusting the priority in the main
function. (kthread_create() should probably take the initial priority
as an argument.)
- Only yield the CPU in the !PREEMPTION case if there are any other
runnable threads. Yielding when there isn't anything else better to do
just wastes time in pointless context switches (albeit while the system
is idle.)


137243 05-Nov-2004 alc

During traversal of the inactive queue, try locking the page's containing
object before accessing the page's flags or the object's reference count.


137242 05-Nov-2004 alc

Eliminate another unnecessary call to vm_page_busy() that immediately
precedes a call to vm_page_rename(). (See the previous revision for a
detailed explanation.)


137239 05-Nov-2004 das

Close a race in swapoff(). Here are the gory details:

In order to avoid livelock, swapoff() skips over objects with a
nonzero pip count and makes another pass if necessary. Since it is
impossible to know which objects we care about, it would choose an
arbitrary object with a nonzero pip count and wait for it before
making another pass, the theory being that this object would finish
paging about as quickly as the ones we care about. Unfortunately,
we may have slept since we acquired a reference to this object.
Hack around this problem by tsleep()ing on the pointer anyway, but
timeout after a fixed interval. More elegant solutions are possible,
but the ones I considered unnecessarily complicate this rare case.

Also, kill some nits that seem to have crept into the swapoff() code
in the last 75 revisions or so:

- Don't pass both sp and sp->sw_used to swap_pager_swapoff(), since
the latter can be derived from the former.

- Replace swp_pager_find_dev() with something simpler. There's no
need to iterate over the entire list of swap devices just to determine
if a given block is assigned to the one we're interested in.

- Expand the scope of the swhash_mtx in a couple of places so that it
isn't released and reacquired once for every hash bucket.

- Don't drop the swhash_mtx while holding a reference to an object.
We need to lock the object first. Unfortunately, doing so would
violate the established lock order, so use VM_OBJECT_TRYLOCK() and
try again on a subsequent pass if the object is already locked.

- Refactor swp_pager_force_pagein() and swap_pager_swapoff() a bit.


137197 04-Nov-2004 phk

Retire b_magic now, we have the bufobj containing the same hint.


137191 04-Nov-2004 phk

De-couple our I/O bio request from the embedded bio in buf by explicitly
copying the fields.


137186 04-Nov-2004 phk

Remove buf->b_dev field.


137168 03-Nov-2004 alc

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


137104 31-Oct-2004 alc

Introduce a Boolean variable wakeup_needed to avoid repeated, unnecessary
calls to wakeup() by vm_page_zero_idle_wakeup().


137091 30-Oct-2004 alc

During traversal of the active queue by vm_pageout_page_stats(), try
locking the page's containing object before accessing the page's flags.


137079 30-Oct-2004 alc

Eliminate an unused but initialized variable.


137060 30-Oct-2004 alc

Add an assignment statement that I omitted from the previous revision.


137005 28-Oct-2004 alc

Assert that the containing vm object is locked in vm_page_cache() and
vm_page_try_to_cache().


137001 27-Oct-2004 bmilekic

Fix a INVARIANTS-only bug introduced in Revision 1.104:

IF INVARIANTS is defined, and in the rare case that we have
allocated some objects from the slab and at least one initializer
on at least one of those objects failed, and we need to fail the
allocation and push the uninitialized items back into the slab
caches -- in that scenario, we would fail to [re]set the
bucket cache's ub_bucket item references to NULL, which would
eventually trigger a KASSERT.


136996 27-Oct-2004 alc

During traversal of the active queue, try locking the page's containing
object before accessing the page's flags or the object's reference count.
If the trylock fails, handle the page as though it is busy.


136977 26-Oct-2004 phk

Also check that the sectormask is bigger than zero.

Wrap this overly long KASSERT and remove newline.


136966 26-Oct-2004 phk

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


136961 26-Oct-2004 phk

Don't clear flags we just checked were not set.


136952 25-Oct-2004 alc

Assert that the containing vm object is locked in vm_page_flash().


136931 24-Oct-2004 alc

Assert that the containing vm object is locked in vm_page_busy() and
vm_page_wakeup().


136927 24-Oct-2004 phk

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


136924 24-Oct-2004 alc

Acquire the vm object lock before rather than after calling
vm_page_sleep_if_busy(). (The motivation being to transition
synchronization of the vm_page's PG_BUSY flag from the global page queues
lock to the per-object lock.)


136923 24-Oct-2004 alc

Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().


136850 24-Oct-2004 alc

Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab()
that indicates that the caller does not want a page with its busy flag set.
In many places, the global page queues lock is acquired and released just
to clear the busy flag on a just allocated page. Both the allocation of
the page and the clearing of the busy flag occur while the containing vm
object is locked. So, the busy flag might as well never be set.


136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


136655 18-Oct-2004 alc

Correct two errors in PG_BUSY management by vm_page_cowfault(). Both
errors are in rarely executed paths.
1. Each time the retry_alloc path is taken, the PG_BUSY must be set again.
Otherwise vm_page_remove() panics.
2. There is no need to set PG_BUSY on the newly allocated page before
freeing it. The page already has PG_BUSY set by vm_page_alloc().
Setting it again could cause an assertion failure.

MFC after: 2 weeks


136627 17-Oct-2004 alc

Assert that the containing object is locked in vm_page_io_start() and
vm_page_io_finish(). The motivation being to transition synchronization of
the vm_page's busy field from the global page queues lock to the per-object
lock.


136621 17-Oct-2004 alc

Remove unnecessary check for curthread == NULL.


136404 11-Oct-2004 peter

Put on my peril sensitive sunglasses and add a flags field to the internal
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.

I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.

With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.

This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.


136334 09-Oct-2004 green

In the previous revision, I did not intend to change the default value
of "nosleepwithlocks."

Submitted by: ru


136276 08-Oct-2004 green

Fix critical stability problems that can cause UMA mbuf cluster
state management corruption, mbuf leaks, general mbuf corruption,
and at least on i386 a first level splash damage radius that
encompasses up to about half a megabyte of the memory after
an mbuf cluster's allocation slab. In short, this has caused
instability nightmares anywhere the right kind of network traffic
is present.

When the polymorphic refcount slabs were added to UMA, the new types
were not used pervasively. In particular, the slab management
structure was turned into one for refcounts, and one for non-refcounts
(supposed to be mostly like the old slab management structure),
but the latter was almost always used through out. In general, every
access to zones with UMA_ZONE_REFCNT turned on corrupted the
"next free" slab offset offset and the refcount with each other and
with other allocations (on i386, 2 mbuf clusters per 4096 byte slab).

Fix things so that the right type is used to access refcounted zones
where it was not before. There are additional errors in gross
overestimation of padding, it seems, that would cause a large kegs
(nee zones) to be allocated when small ones would do. Unless I have
analyzed this incorrectly, it is not directly harmful.


135746 24-Sep-2004 das

Don't look for swap blocks in objects that aren't swap-backed.
I expect that this will fix the following panic, reported by Jun:
swap_pager_isswapped: failed to locate all swap meta blocks

MT5 candidate


135727 24-Sep-2004 phk

XXX mark two places where we do not hold a threadcount on the dev when
frobbing the cdevsw.

In both cases we examine only the cdevsw and it is a good question if we
weren't better off copying those properties into the cdev in the first
place. This question will be revisited.


135707 24-Sep-2004 phk

Use dev_re[fl]thread() to maintain a ref on the device driver while
we call the ->d_mmap function.


135470 19-Sep-2004 das

The zone from which proc structures are allocated is marked
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.


135262 15-Sep-2004 phk

Add new a function isa_dma_init() which returns an errno when it fails
and which takes a M_WAITOK/M_NOWAIT flag argument.

Add compatibility isa_dmainit() macro which whines loudly if
isa_dma_init() fails.

Problem uncovered by: tegge


135088 11-Sep-2004 alc

System maps are prohibited from mapping vnode-backed objects. Take
advantage of this restriction to avoid acquiring and releasing Giant when
wiring pages within a system map.

In collaboration with: tegge@


134892 07-Sep-2004 phk

add KASSERTS


134747 04-Sep-2004 alc

Enable debug.mpsafevm by default on amd64 and i386. This enables copy-on-
write and zero-fill faults to run without holding Giant. It is still
possible to disable Giant-free operation by setting debug.mpsafevm to 0 in
loader.conf.


134675 03-Sep-2004 alc

Push Giant deep into vm_forkproc(), acquiring it only if the process has
mapped System V shared memory segments (see shmfork_myhook()) or requires
the allocation of an ldt (see vm_fault_wire()).


134649 02-Sep-2004 scottl

Turn PREEMPTION into a kernel option. Make sure that it's defined if
FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c). This
is a possible MT5 candidate.


134615 01-Sep-2004 alc

Remove dead code.


134612 01-Sep-2004 alc

In vm_fault_unwire() eliminate the acquisition and release of Giant in the
case of non-kernel pmaps.


134586 01-Sep-2004 julian

Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after: 2 days


134496 29-Aug-2004 alc

Move the acquisition and release of the lock on the object at the head of
the shadow chain outside of the loop in vm_object_madvise(), reducing the
number of times that this lock is acquired and released.


134461 29-Aug-2004 iedowse

Prevent vm_page_zero_idle_wakeup() from attempting to wake up the
page zeroing thread before it has been created. It was possible for
calls to free() very early in the boot process to panic here because
the sleep queues were not yet initialised. Specifically, sysinit_add()
running at SI_SUB_KLD would trigger this if the array of pointers
became big enough to require uma_large_alloc() allocations.

Submitted by: peter


134184 22-Aug-2004 marcel

Move the cow field between wire_count and hold_count. This is the
position that is 64-bit aligned and makes sure that the valid and
dirty fields are also 64-bit aligned. This means that if PAGE_SIZE
is 32K, the size of the vm_page structure is only increased by 8
bytes instead of 16 bytes. More importantly, the vm_page structure
is either 120 or 128 bytes on ia64. These are "interesting" sizes.


134139 22-Aug-2004 alc

In the previous revision, I failed to condition an early release of Giant
in vm_fault() on debug_mpsafevm. If debug_mpsafevm was not set, the result
was an assertion failure early in the boot process.

Reported by: green@


134128 21-Aug-2004 alc

Further reduce the use of Giant by vm_fault(): Giant is held only when
manipulating a vnode, e.g., calling vput(). This reduces contention for
Giant during many copy-on-write faults, resulting in some additional
speedup on SMPs.

Note: debug_mpsafevm must be enabled for this optimization to take effect.


133996 19-Aug-2004 alc

Acquire and release Giant around a call to VOP_BMAP(). (This is a
prerequisite to any further reduction in Giant's use by vm_fault().)


133807 16-Aug-2004 alc

- Introduce and use a new tunable "debug.mpsafevm". At present, setting
"debug.mpsafevm" results in (almost) Giant-free execution of zero-fill
page faults. (Giant is held only briefly, just long enough to determine
if there is a vnode backing the faulting address.)

Also, condition the acquisition and release of Giant around calls to
pmap_remove() on "debug.mpsafevm".

The effect on performance is significant. On my dual Opteron, I see a
3.6% reduction in "buildworld" time.

- Use atomic operations to update several counters in vm_fault().


133796 16-Aug-2004 green

Rather than bringing back all of the changes to make VM map deletion
wait for system wires to disappear, do so (much more trivially) by
instead only checking for system wires of user maps and not kernel maps.

Alternative by: tor
Reviewed by: alc


133726 14-Aug-2004 alc

Remove spl calls.


133636 13-Aug-2004 alc

Replace the linear search in vm_map_findspace() with an O(log n)
algorithm built into the map entry splay tree. This replaces the
first_free hint in struct vm_map with two fields in vm_map_entry:
adj_free, the amount of free space following a map entry, and
max_free, the maximum amount of free space in the entry's subtree.
These fields make it possible to find a first-fit free region of a
given size in one pass down the tree, so O(log n) amortized using
splay trees.

This significantly reduces the overhead in vm_map_findspace() for
applications that mmap() many hundreds or thousands of regions, and
has a negligible slowdown (0.1%) on buildworld. See, for example, the
discussion of a micro-benchmark titled "Some mmap observations
compared to Linux 2.6/OpenBSD" on -hackers in late October 2003.

OpenBSD adopted this approach in March 2002, and NetBSD added it in
November 2003, both with Red-Black trees.

Submitted by: Mark W. Krentel


133598 12-Aug-2004 tegge

The vm map lock is needed in vm_fault() after the page has been found,
to avoid later changes before pmap_enter() and vm_fault_prefault()
has completed.

Simplify deadlock avoidance by not blocking on vm map relookup.

In collaboration with: alc


133587 12-Aug-2004 green

Re-delete the comment from r1.352.


133435 10-Aug-2004 green

Back out all behavioral chnages.


133401 09-Aug-2004 green

Revamp VM map wiring.

* Allow no-fault wiring/unwiring to succeed for consistency;
however, the wired count remains at zero, so it's a special case.

* Fix issues inside vm_map_wire() and vm_map_unwire() where the
exact state of user wiring (one or zero) and system wiring
(zero or more) could be confused; for example, system unwiring
could succeed in removing a user wire, instead of being an
error.

* Require all mappings to be unwired before they are deleted.
When VM space is still wired upon deletion, it will be waited
upon for the following unwire. This makes vslock(9) work
rather than allowing kernel-locked memory to be deleted
out from underneath of its consumer as it would before.


133398 09-Aug-2004 alc

Make two changes to vm_fault().
1. Move a comment to its proper place, updating it. (Except for white-
space, this comment had been unchanged since revision 1.1!)
2. Remove spl calls.


133395 09-Aug-2004 alc

Remove a stale comment from vm_map_lookup() that pertains to share maps.
(The last vestiges of the share map code were removed in revisions 1.153
and 1.159.)


133355 09-Aug-2004 alc

Make two changes to vm_fault().
1. Retain the map lock until after the calls to pmap_enter() and
vm_fault_prefault().
2. Remove a stale comment. Submitted by: tegge@


133318 08-Aug-2004 phk

Tag all geom classes in the tree with a version number.


133253 07-Aug-2004 alc

Remove dead code. A vm_map's first_free is never NULL (even if the map is
full).

(This is preparation for an O(log n) implementation of vm_map_findspace().)

Submitted by: Mark W. Krentel


133230 06-Aug-2004 rwatson

Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg().
This doesn't trace every event of interest in UMA, but provides
enough basic information to explain lock traces and sleep patterns.


133185 05-Aug-2004 green

Turn on the new contigmalloc(9) by default. There should not actually
be a reason to use the old contigmalloc(9), but if desired, it the
vm.old_contigmalloc setting can be tuned/sysctld back to 0 for now.


133158 05-Aug-2004 phk

Remove a product specific workaround for wrong modes when mmap(2)'ing
devices. They have had plenty of time to adjust now.


133143 04-Aug-2004 alc

- Push down the acquisition and release of Giant into pmap_enter_quick()
on those architectures without pmap locking.
- Eliminate the acquisition and release of Giant in vm_map_pmap_enter().


133113 04-Aug-2004 dfr

In dev_pager_updatefake, m->valid is typically 0 on entry. It
should be set to VM_PAGE_BITS_ALL before returning, to ensure that
neither vm_pager_get_pages nor vm_fault calls vm_page_zero_invalid
after dev_pager_getpages has returned.

Submitted by: tegge


132999 02-Aug-2004 alc

Eliminate the acquisition and release of Giant around the call to
pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64,
i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).


132987 02-Aug-2004 green

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


132899 30-Jul-2004 alc

- Push down the acquisition and release of Giant into pmap_protect() on
those architectures without pmap locking.
- Eliminate the acquisition and release of Giant from vm_map_protect().

(Translation: mprotect(2) runs to completion without touching Giant on
alpha, amd64, i386 and ia64.)


132898 30-Jul-2004 alc

Giant is no longer required by vm_waitproc() and vmspace_exitfree().
Eliminate it acquisition and release around vm_waitproc() in kern_wait().


132884 30-Jul-2004 dfr

Fix a memory leak in the device pager which is exposed by the NVIDIA
OpenGL driver.

Submitted by: nvidia (possibly also tegge)


132883 30-Jul-2004 dfr

Fix handling of msync(2) for character special files.

Submitted by: nvidia


132880 30-Jul-2004 mux

Get rid of another lockmgr(9) consumer by using sx locks for the user
maps. We always acquire the sx lock exclusively here, but we can't
use a mutex because we want to be able to sleep while holding the
lock. This is completely equivalent to what we were doing with the
lockmgr(9) locks before.

Approved by: alc


132852 29-Jul-2004 alc

Advance the state of pmap locking on alpha, amd64, and i386.

- Enable recursion on the page queues lock. This allows calls to
vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page
queues lock held. Such calls are made to allocate page table pages
and pv entries.
- The previous change enables a partial reversion of vm/vm_page.c
revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault()
now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT.
- Add partial locking to pmap_copy(). (As a side-effect, pmap_copy()
should now be faster on i386 SMP because it no longer generates IPIs
for TLB shootdown on the other processors.)
- Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now,
all changes to a user-level pmap on alpha, amd64, and i386 are performed
with appropriate locking.)


132842 29-Jul-2004 bmilekic

Rework the way slab header storage space is calculated in UMA.

- zone_large_init() stays pretty much the same.
- zone_small_init() will try to stash the slab header in the slab page
being allocated if the amount of calculated wasted space is less
than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular
case). If the amount of wasted space is >= UMA_MAX_WASTE, then
UMA_ZONE_OFFPAGE will be set and the slab header will be allocated
separately for better use of space.
- uma_startup() calculates the maximum ipers required in offpage slabs
(so that the offpage slab header zone(s) can be sized accordingly).
The algorithm used to calculate this replaces the old calculation
(which only happened to work coincidentally). We now iterate over
possible object sizes, starting from the smallest one, until we
determine that wastedspace calculated in zone_small_init() might
end up being greater than UMA_MAX_WASTE, at which point we use the
found object size to compute the maximum possible ipers. The
reason this works is because:
- wastedspace versus objectsize is a see-saw function with
local minima all equal to zero and local maxima growing
directly proportioned to objectsize. This implies that
for objects up to or equal a certain objectsize, the see-saw
remains entirely below UMA_MAX_WASTE, so for those objectsizes
it is impossible to ever go OFFPAGE for slab headers.
- ipers (items-per-slab) versus objectsize is an inversely
proportional function which falls off very quickly (very large
for small objectsizes).
- To determine the maximum ipers we'll ever need from OFFPAGE
slab headers we first find the largest objectsize for which
we are guaranteed to not go offpage for and use it to compute
ipers (as though we were offpage). Since the only objectsizes
allowed to go offpage are bigger than the found objectsize,
and since ipers vs objectsize is inversely proportional (and
monotonically decreasing), then we are guaranteed that the
ipers computed is always >= what we will ever need in offpage
slab headers.
- Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly
padded) size of each freelist index so that offset calculations are
fixed.

This might fix weird data corruption problems and certainly allows
ARM to now boot to at least single-user (via simulator).

Tested on i386 UP by me.
Tested on sparc64 SMP by fenner.
Tested on ARM simulator to single-user by cognet.


132804 28-Jul-2004 alc

Correct a very old error in both vm_object_madvise() (originating in
vm/vm_object.c revision 1.88) and vm_object_sync() (originating in
vm/vm_map.c revision 1.36): When descending a chain of backing objects,
both use the wrong object's backing offset. Consequently, both may
operate on the wrong pages.

Quoting Matt, "This could be responsible for all of the sporatic madvise
oddness that has been reported over the years."

Reviewed by: Matt Dillon


132684 27-Jul-2004 alc

- Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.


132638 25-Jul-2004 alc

For years, kmem_alloc_pageable() has been misused. Now that the last of
these misuses has been corrected, remove it before new ones appear, such as
arm/arm/pmap.c revision 1.8.


132636 25-Jul-2004 alc

Remove spl calls.


132627 25-Jul-2004 alc

Make the code and comments for vm_object_coalesce() consistent.


132593 24-Jul-2004 alc

Simplify vmspace initialization. The bcopy() of fields from the old
vmspace to the new vmspace in vmspace_exec() is mostly wasted effort. With
one exception, vm_swrss, the copied fields are immediately overwritten.
Instead, initialize these fields to zero in vmspace_alloc(), eliminating a
bcopy() from vmspace_exec() and a bzero() from vmspace_fork().


132550 22-Jul-2004 alc

- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of
kmem_alloc_pageable(). The difference between these is that an errant
memory access to the zone will be detected sooner with
kmem_alloc_nofault().

The following changes serve to eliminate the following lock-order
reversal reported by witness:

1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311
2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797
3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931

There is no potential deadlock in this case. However, witness is unable
to recognize this because vm objects used by UMA have the same type as
ordinary vm objects. To remedy this, we make the following changes:

- Add a mutex type argument to VM_OBJECT_LOCK_INIT().
- Use the mutex type argument to assign distinct types to special
vm objects such as the kernel object, kmem object, and UMA objects.
- Define a static swap zone object for use by UMA. (Only static
objects are assigned a special mutex type.)


132517 21-Jul-2004 green

Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking
being incomplete, it currently has to know how to drop and pick back
up the vm_object's mutex if it has to sleep and drop the page queue
mutex. The problem with this is that if the page is busy, while we
are sleeping, the page can be freed and object disappear. When trying
to lock m->object, we'd get a stale or NULL pointer and crash.

The object is now cached, but this makes the assumption that
the object is referenced in some manner and will not itself
disappear while it is unlocked. Since this only happens if
the object is locked, I had to remove an assumption earlier in
contigmalloc() that reversed the order of locking the object and
doing vm_page_sleep_if_busy(), not the normal order.


132483 21-Jul-2004 peter

Semi-gratuitous change. Move two refcount operations to their own lines
rather than be buried inside an if (expression). And now that the if
expression is the same in both exit paths, use the same ordering.


132475 21-Jul-2004 peter

Move the initialization and teardown of pmaps to the vmspace zone's
init and fini handlers. Our vm system removes all userland mappings at
exit prior to calling pmap_release. It just so happens that we might
as well reuse the pmap for the next process since the userland slate
has already been wiped clean.

However. There is a functional benefit to this as well. For platforms
that share userland and kernel context in the same pmap, it means that
the kernel portion of a pmap remains valid after the vmspace has been
freed (process exit) and while it is in uma's cache. This is significant
for i386 SMP systems with kernel context borrowing because it avoids
a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case.

Tested on: amd64, i386, sparc64, alpha
Glanced at by: alc


132420 19-Jul-2004 green

Remove extraneous locks on the VM free page queue mutex; it is not
meant to be recursed upon, and could cauuse a deadlock inside the
new contigmalloc (vm.old_contigmalloc=0) code.

Submitted by: alc


132414 19-Jul-2004 alc

- Eliminate the pte object from the pmap. Instead, page table pages are
allocated as "no object" pages. Similar changes were made to the amd64
and i386 pmap last year. The primary reason being that maintaining
a pte object leads to lock order violations. A secondary reason being
that the pte object is redundant, i.e., the page table itself can be
used to lookup page table pages. (Historical note: The pte object
predates our ability to allocate "no object" pages. Thus, the pte
object was a necessary evil.)
- Unconditionally check the vm object lock's status in vm_page_remove().
Previously, this assertion could not be made on Alpha due to its use
of a pte object.


132407 19-Jul-2004 green

Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in
GENERIC/for WITNESS users, make sure the sysctl to disable the behavior
is read-only and always enabled.


132379 19-Jul-2004 green

Reimplement contigmalloc(9) with an algorithm which stands a greatly-
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.

The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less. The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block. If this is not enough, the subsequent passes
page out any unwired memory. To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.

The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed. Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.

There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used. Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month. It is safe to switch at
run-time to see the difference it makes.

A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig(). These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.

When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats. Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well. This invalidates the BUGS section of the
contigmalloc(9) manpage.


132336 18-Jul-2004 alc

Remove the GIANT_REQUIRED preceding pmap_remove() in
vm_pageout_map_deactivate_pages().


132220 15-Jul-2004 alc

Push down the acquisition and release of the page queues lock into
pmap_protect() and pmap_remove(). In general, they require the lock in
order to modify a page's pv list or flags. In some cases, however,
pmap_protect() can avoid acquiring the lock.


132040 12-Jul-2004 alc

Remove an unused and unimplemented sysctl. (For the record, it was marked
as unimplemented in revision 1.129 nearly six years ago.)


131937 10-Jul-2004 alc

Increase the scope of the page queues lock in vm_page_alloc() to cover
a diagnostic check that accesses the cache queue count.


131719 06-Jul-2004 alc

Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt and
vm_exitingcnt so that alignment does not result in wasted space.


131665 06-Jul-2004 bms

Properly brucify a string by outdenting it.


131573 04-Jul-2004 bmilekic

Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1
and WITNESS is not built, then force all M_WAITOK allocations to
M_NOWAIT behavior (transparently). This is to be used temporarily
if wierd deadlocks are reported because we still have code paths
that perform M_WAITOK allocations with lock(s) held, which can
lead to deadlock. If WITNESS is compiled, then the sysctl is ignored
and we ask witness to tell us wether we have locks held, converting
to M_NOWAIT behavior only if it tells us that we do.

Note this removes the previous mbuf.h inclusion as well (only needed
by last revision), and cleans up unneeded [artificial] comparisons
to just the mbuf zones. The problem described above has nothing to
do with previous mbuf wait behavior; it is a general problem.


131572 04-Jul-2004 green

Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related
zones, and do it by direct comparison of uma_zone_t instead of strcmp.

The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but
this is mostly no longer the case. M_WAITOK has taken over the spot
M_TRYWAIT used to have, and for mbuf things, still may return NULL if
the code path is incorrectly holding a mutex going into mbuf allocation
functions.

The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock
the system to try to malloc or uma_zalloc something with a mutex held
and M_WAITOK specified, it is absolutely required to not return NULL
and will result in instability and/or security breaches otherwise.
There is still room to add the WITNESS_WARN() to all cases so that
we are notified of the possibility of deadlocks, but it cannot change
the value of the "badness" variable and allow allocation to actually
fail except for the specialized cases which used to be M_TRYWAIT.


131528 03-Jul-2004 green

Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject
to failing -- that is, allocations via malloc(M_WAITOK) that are required
to never fail -- if WITNESS is not defined. While everyone should be
running WITNESS, in any case, zone "Mbuf" allocations are really the only
ones that should be screwed with by this hack.

This hack is crashing people, and would continue to do so with or without
WITNESS. Things shouldn't be allocating with M_WAITOK with locks held,
but it's not okay just to always remove M_WAITOK when !WITNESS.

Reported by: Bernd Walter <ticso@cicely5.cicely.de>


131481 02-Jul-2004 jhb

Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by: scottl (with his re@ hat)


131473 02-Jul-2004 jhb

- Change mi_switch() and sched_switch() to accept an optional thread to
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.


131434 02-Jul-2004 jhb

- Don't use a variable to point to the user area that we only use once.
Just use p2->p_uarea directly instead.
- Remove an old and mostly bogus assertion regarding p2->p_sigacts.
- Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.


131256 28-Jun-2004 tegge

Initialize result->backing_object_offset before linking result onto the list of
vm objects shadowing source in vm_object_shadow(). This closes a race where
vm_object_collapse() could be called with a partially uninitialized object
argument causing symptoms that looked like hardware problems, e.g. signal 6,
10, 11 or a /bin/sh busy-waiting for a nonexistant child process.


131252 28-Jun-2004 gallatin

Use MIN() macro rather than ulmin() inline, and fix stray tab
that snuck in with my last commit.

Submitted by: green


131251 28-Jun-2004 gallatin

Fix alpha - the use of min() on longs was loosing the high bits and
returning wrong answers, leading to strange values vm2->vm_{s,t,d}size.


131163 27-Jun-2004 das

Update a stale comment. The heuristic to swap processes out based on
the number of pages already paged out was broken in rev 1.10 and
removed in rev 1.11.


131152 26-Jun-2004 alc

Remove an unused field from the vmspace structure.


131073 24-Jun-2004 green

Correct the tracking of various bits of the process's vmspace and vm_map
when not propogated on fork (due to minherit(2)). Consistency checks
otherwise fail when the vm_map is freed and it appears to have not been
emptied completely, causing an INVARIANTS panic in vm_map_zdtor().

PR: kern/68017
Submitted by: Mark W. Krentel <krentel@dreamscape.com>
Reviewed by: alc


131027 24-Jun-2004 alc

Call vm_pageout_page_stats() with the page queues lock held.


131023 24-Jun-2004 alc

Remove spl calls.


130995 23-Jun-2004 bmilekic

Make uma_mtx MTX_RECURSE. Here's why:

The general UMA lock is a recursion-allowed lock because
there is a code path where, while we're still configured
to use startup_alloc() for backend page allocations, we
may end up in uma_reclaim() which calls zone_foreach(zone_drain),
which grabs uma_mtx, only to later call into startup_alloc()
because while freeing we needed to allocate a bucket. Since
startup_alloc() also takes uma_mtx, we need to be able to
recurse on it.

This exact explanation also added as comment above mtx_init().

Trace showing recursion reported by: Peter Holm <peter-at-holm.cc>


130979 23-Jun-2004 bms

In swap_pager_getpages(), bp->b_dev can be NULL, particularly for the
case of NFS mounted swap, so do not try to dereference it.

While we're here, brucify the printf() call which happens when we
time out on acquisition of vm_page_queue_mtx.

PR: kern/67898
Submitted by: bde (style)


130710 19-Jun-2004 alc

Remove spl() calls. Update comments to reflect the removal of spl() calls.
Remove '\n' from panic() format strings. Remove some blank lines.


130640 17-Jun-2004 phk

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


130626 17-Jun-2004 alc

Do not preset PG_BUSY on VM_ALLOC_NOOBJ pages. Such pages are not
accessible through an object. Thus, PG_BUSY serves no purpose.


130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


130551 16-Jun-2004 julian

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


130502 15-Jun-2004 green

Make contigmalloc() more reliable:

1. Remove a race whereby contigmalloc() would deadlock against the
running processes in the system if they kept reinstantiating
the memory on the active and inactive page queues that it was
trying to flush out. The process doing the contigmalloc() would
sit in "swwrt" forever and the swap pager would be going at full
force, but never get anywhere. Instead of doing it until the
queues are empty, launder for as many iterations as there are
pages in the queue.
2. Do all laundering to swap synchronously; previously, the vnode
laundering was synchronous and the swap laundering not.
3. Increase the number of launder-or-allocate passes to three, from
two, while failing without bothering to do all the laundering on
the third pass if allocation was not possible. This effectively
gives exactly two chances to launder enough contiguous memory,
helpful with high memory churn where a lot of memory from one pass
to the next (and during a single laundering loop) becomes dirtied
again.

I can now reliably hot-plug hardware requiring a 256KB contigmalloc()
without having the kldload/cbb ithread sit around failing to make
progress, while running a busy X session. Previously, it took killing
X to get contigmalloc() to get further (that is, quiescing the system),
and even then contigmalloc() returned failure.


130344 11-Jun-2004 phk

Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.


130283 09-Jun-2004 bmilekic

Backout previous change, I think Julian has a better solution which
does not require type-stable refcnts here.


130278 09-Jun-2004 bmilekic

Make the slabrefzone, the zone from which we allocated slabs with
internal reference counters, UMA_ZONE_NOFREE. This way, those slabs
(with their ref counts) will be effectively type-stable, then using
a trick like this on the refcount is no longer dangerous:

MEXT_REM_REF(m);
if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) {
if (m->m_ext.ext_type == EXT_PACKET) {
uma_zfree(zone_pack, m);
return;
} else if (m->m_ext.ext_type == EXT_CLUSTER) {
uma_zfree(zone_clust, m->m_ext.ext_buf);
m->m_ext.ext_buf = NULL;
} else {
(*(m->m_ext.ext_free))(m->m_ext.ext_buf,
m->m_ext.ext_args);
if (m->m_ext.ext_type != EXT_EXTREF)
free(m->m_ext.ref_cnt, M_MBUF);
}
}
uma_zfree(zone_mbuf, m);

Previously, a second thread hitting the above cmpset might
actually read the refcnt AFTER it has already been freed. A very
rare occurance. Now we'll know that it won't be freed, though.

Spotted by: julian, pjd


130201 07-Jun-2004 netchild

Remove references to L1 in the comments, according to Alan they are
historical leftovers.

Approved by: alc


130137 05-Jun-2004 alc

Update stale comments regarding page coloring.


130049 04-Jun-2004 alc

Move the definitions of SWAPBLK_NONE and SWAPBLK_MASK from vm_page.h to
blist.h, enabling the removal of numerous #includes from subr_blist.c.
(subr_blist.c and swap_pager.c are the only users of these definitions.)


129913 01-Jun-2004 bmilekic

Fix a comment above uma_zsecond_create(), describing its arguments.
It doesn't take 'align' and 'flags' but 'master' instead, which is
a reference to the Master Zone, containing the backing Keg.

Pointed out by: Tim Robbins (tjr)


129906 31-May-2004 bmilekic

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


129883 30-May-2004 alc

Remove a stale comment: PG_DIRTY and PG_FILLED were removed in
revisions 1.17 and 1.12 respectively.


129857 30-May-2004 hmp

Correct typo, vm_page_list_find() is called vm_pageq_find() for quite a
long time, i.e., since the cleanup of the VM Page-queues code done two
years ago.

Reviewed by: Alan Cox <alc at freebsd.org>,
Matthew Dillon <dillon at backplane.com>


129729 25-May-2004 des

MFS: vm_map.c rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE
semantics but provide a sysctl knob for reverting to old ones.


129728 25-May-2004 des

Back out previous commit; it went to the wrong file.


129725 25-May-2004 des

MFS: rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE semantics but
provide a sysctl knob for reverting to old ones.


129701 25-May-2004 alc

Correct two error cases in vm_map_unwire():

1. Contrary to the Single Unix Specification our implementation of
munlock(2) when performed on an unwired virtual address range has
returned an error. Correct this. Note, however, that the behavior
of "system" unwiring is unchanged, only "user" unwiring is changed.
If "system" unwiring is performed on an unwired virtual address
range, an error is still returned.

2. Performing an errant "system" unwiring on a virtual address range
that was "user" (i.e., mlock(2)) but not "system" wired would
incorrectly undo the "user" wiring instead of returning an error.
Correct this.

Discussed with: green@
Reviewed by: tegge@


129571 22-May-2004 alc

To date, unwiring a fictitious page has produced a panic. The reason
being that PHYS_TO_VM_PAGE() returns the wrong vm_page for fictitious
pages but unwiring uses PHYS_TO_VM_PAGE(). The resulting panic
reported an unexpected wired count. Rather than attempting to fix
PHYS_TO_VM_PAGE(), this fix takes advantage of the properties of
fictitious pages. Specifically, fictitious pages will never be
completely unwired. Therefore, we can keep a fictitious page's wired
count forever set to one and thereby avoid the use of
PHYS_TO_VM_PAGE() when we know that we're working with a fictitious
page, just not which one.

In collaboration with: green@, tegge@
PR: kern/29915


129145 12-May-2004 alc

Restructure vm_page_select_cache() so that adding assertions is easy.

Some of the conditions that caused vm_page_select_cache() to deactivate a
page were wrong. For example, deactivating an unmanaged or wired page is a
nop. Thus, if vm_page_select_cache() had ever encountered an unmanaged or
wired page, it would have looped forever. Now, we assert that the page is
neither unmanaged nor wired.


129143 12-May-2004 alc

Cache queue pages are not mapped. Thus, the pmap_remove_all() by
vm_pageout_scan()'s loop for freeing cache queue pages is unnecessary.


129110 11-May-2004 tjr

To handle orphaned character device vnodes properly in mmap(), check that
v_mount is non-null before dereferencing it. If it's null, behave as if
MNT_NOEXEC was not set on the mount that originally containined it.


129057 09-May-2004 alc

Cache queue pages are not mapped. Thus, the pmap_remove_all() by
vm_page_alloc() is unnecessary.


129028 07-May-2004 green

In r1.190, vslock() and vsunlock() were bogusly made to do a "user wire"
and a "system unwire." Make this a "system wire" and "system unwire."

Reviewed by: alc


129018 07-May-2004 green

Properly remove MAP_FUTUREWIRE when a vm_map_entry gets torn down.
Previously, mlockall(2) usage would leak MAP_FUTUREWIRE of the process's
vmspace::vm_map and subsequent processes would wire all of their memory.
Coupled with a wired-page leak in vm_fault_unwire(), this would run the
system out of free pages and cause programs to randomly SIGBUS when
faulting in new pages.

(Note that this is not the fix for the latter part; pages are still
leaked when a wired area is unmapped in some cases.)

Reviewed by: alc
PR kern/62930


128992 06-May-2004 alc

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


128633 25-Apr-2004 alc

Zero the physical page only if it is invalid and not prezeroed.


128620 24-Apr-2004 alc

Add a VM_OBJECT_LOCK_ASSERT() call. Remove splvm() and splx() calls. Move
a comment.


128614 24-Apr-2004 alc

Update the comment describing vm_page_grab() to reflect the previous
revision and correct some of its style errors.


128613 24-Apr-2004 alc

Push down the responsibility for zeroing a physical page from the
caller to vm_page_grab(). Although this gives VM_ALLOC_ZERO a
different meaning for vm_page_grab() than for vm_page_alloc(), I feel
such change is necessary to accomplish other goals. Specifically, I
want to make the PG_ZERO flag immutable between the time it is
allocated by vm_page_alloc() and freed by vm_page_free() or
vm_page_free_zero() to avoid locking overheads. Once we gave up on
the ability to automatically recognize a zeroed page upon entry to
vm_page_free(), the ability to mutate the PG_ZERO flag became useless.
Instead, I would like to say that "Once a page becomes valid, its
PG_ZERO flag must be ignored."


128596 24-Apr-2004 alc

In cases where a file was resident in memory mmap(..., PROT_NONE, ...)
would actually map the file with read access enabled. According to
http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html this is
an error. Similarly, an madvise(..., MADV_WILLNEED) would enable read
access on a virtual address range that was PROT_NONE.

The solution implemented herein is (1) to pass a vm_prot_t to
vm_map_pmap_enter() describing the allowed access and (2) to make
vm_map_pmap_enter() responsible for understanding the limitations of
pmap_enter_quick().

Submitted by: "Mark W. Krentel" <krentel@dreamscape.com>
PR: kern/64573


128570 23-Apr-2004 alc

Push down Giant into vm_pager_get_pages(). The only get pages methods that
require Giant are in the device and vnode pagers.


128097 10-Apr-2004 alc

- pmap_kenter_temporary() is unused by machine-independent code. Therefore,
move its declaration to the machine-dependent header file on those
machines that use it. In principle, only i386 should have it.
Alpha and AMD64 should use their direct virtual-to-physical mapping.
- Remove pmap_kenter_temporary() from ia64. It is unused. Approved
by: marcel@


128038 08-Apr-2004 alc

The demise of vm_pager_map_page() in revision 1.93 of vm/vm_pager.c permits
the reduction of the pager map's size by 8M bytes. In other words, eight
megabytes of largely wasted KVA are returned to the kernel map for use
elsewhere.


127961 06-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


127926 06-Apr-2004 alc

Eliminate vm_pager_map_page() and vm_pager_unmap_page() and their uses.
Use sf_buf_alloc() and sf_buf_free() instead.


127879 05-Apr-2004 kan

Delay permission checks for VCHR vnodes until after vnode is locked in
vm_mmap_vnode function, where we can safely check for a special /dev/zero
case. Rev. 1.180 has reordered checks and introduced a regression.

Submitted by: alc
Was broken by: kan


127869 05-Apr-2004 alc

Remove unused arguments from pmap_init().


127868 04-Apr-2004 alc

Eliminate unused arguments from vm_page_startup().


127327 23-Mar-2004 tjr

Do not copy vm_exitingcnt to the new vmspace in vmspace_exec(). Copying
it led to impossibly high values in the new vmspace, causing it to never
drop to 0 and be freed.


127187 18-Mar-2004 guido

When mmap-ing a file from a noexec mount, be sure not to grant the right
to mmap it PROT_EXEC. This also depends on the architecture, as some
architextures (e.g. i386) do not distinguish between read and exec pages

Inspired by: http://linux.bkbits.net:8080/linux-2.4/cset@1.1267.1.85
Reviewed by: alc


127013 15-Mar-2004 truckman

Make overflow/wraparound checking more robust and unbreak len=0 in
vslock(), mlock(), and munlock().

Reviewed by: bde


127008 15-Mar-2004 truckman

Style(9) changes.

Pointed out by: bde


127007 15-Mar-2004 truckman

Revert to the original vslock() and vsunlock() API with the following
exceptions:
Retain the recently added vslock() error return.

The type of the len argument should be size_t, not u_int.

Suggested by: bde


127006 15-Mar-2004 truckman

Remove redundant suser() check.


126911 13-Mar-2004 alc

Remove GIANT_REQUIRED from contigfree().


126865 12-Mar-2004 peter

Part 2 of rev 1.68. Update comment to match reality now that vm_endcopy
exists and we no longer copy to the end of the struct.

Forgotten by: alfred and green


126793 10-Mar-2004 alc

- Make the acquisition of Giant in vm_fault_unwire() conditional on the
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)


126739 08-Mar-2004 alc

Implement a work around for the deadlock avoidance case in
vm_object_deallocate() so that it doesn't spin forever either.

Submitted by: bde


126728 07-Mar-2004 alc

Retire pmap_pinit2(). Alpha was the last platform that used it. However,
ever since alpha/alpha/pmap.c revision 1.81 introduced the list allpmaps,
there has been no reason for having this function on Alpha. Briefly,
when pmap_growkernel() relied upon the list of all processes to find and
update the various pmaps to reflect a growth in the kernel's valid
address space, pmap_init2() served to avoid a race between pmap
initialization and pmap_growkernel(). Specifically, pmap_pinit2() was
responsible for initializing the kernel portions of the pmap and
pmap_pinit2() was called after the process structure contained a pointer
to the new pmap for use by pmap_growkernel(). Thus, an update to the
kernel's address space might be applied to the new pmap unnecessarily,
but an update would never be lost.


126714 07-Mar-2004 rwatson

Mark uma_callout as CALLOUT_MPSAFE, as uma_timeout can run MPSAFE.

Reviewed by: jeff


126668 05-Mar-2004 truckman

Undo the merger of mlock()/vslock and munlock()/vsunlock() and the
introduction of kern_mlock() and kern_munlock() in
src/sys/kern/kern_sysctl.c 1.150
src/sys/vm/vm_extern.h 1.69
src/sys/vm/vm_glue.c 1.190
src/sys/vm/vm_mmap.c 1.179
because different resource limits are appropriate for transient and
"permanent" page wiring requests.

Retain the kern_mlock() and kern_munlock() API in the revived
vslock() and vsunlock() functions.

Combine the best parts of each of the original sets of implementations
with further code cleanup. Make the mclock() and vslock()
implementations as similar as possible.

Retain the RLIMIT_MEMLOCK check in mlock(). Move the most strigent
test, which can return EAGAIN, last so that requests that have no
hope of ever being satisfied will not be retried unnecessarily.

Disable the test that can return EAGAIN in the vslock() implementation
because it will cause the sysctl code to wedge.

Tested by: Cy Schubert <Cy.Schubert AT komquats.com>


126632 05-Mar-2004 alc

In the last revision, I introduced a physical contiguity check that is both
unnecessary and wrong. While it is necessary to verify that the page is
still free after dropping and reacquiring the free page queue lock, the
physical contiguity of the page can not change, making this check
unnecessary. This check was wrong in that it could cause an out-of-bounds
array access.

Tested by: rwatson


126588 04-Mar-2004 bde

Record exactly where this file was copied from. It wasn't repo-copied so
this is not very obvious.

Fixed some style bugs (mainly missing parentheses around return values).


126585 04-Mar-2004 bde

Minor style fixes. In vm_daemon(), don't fetch the rss limit long before
it is needed.


126571 04-Mar-2004 alc

Remove some long unused definitions.


126479 02-Mar-2004 alc

Modify contigmalloc1() so that the free page queues lock is not held when
vm_page_free() is called. The problem with holding this lock is that it is
a spin lock and vm_page_free() may attempt the acquisition of a different
default-type lock.


126424 01-Mar-2004 kan

Pich up a do {} while(0) cleanup by phk that was discarded accidentally in
previous revision.

Submitted by: alc


126332 27-Feb-2004 kan

Move the code dealing with vnode out of several functions into a single
helper function vm_mmap_vnode.

Discussed with: jeffr,alc (a while ago)


126253 26-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


126135 23-Feb-2004 alc

- Substitute bdone() and bwait() from vfs_bio.c for
swap_pager_putpages()'s buffer completion code. Note: the only
difference between swp_pager_sync_iodone() and bdone(), aside from
the locking in the latter, was the unnecessary clearing of B_ASYNC.
- Remove an unnecessary pmap_page_protect() from
swp_pager_async_iodone().

Reviewed by: tegge


126108 22-Feb-2004 alc

Correct a long-standing race condition in vm_object_page_remove() that
could result in a dirty page being unintentionally freed.

Reviewed by: tegge
MFC after: 7 days


126088 21-Feb-2004 alc

Eliminate the second, unnecessary call to pmap_page_protect() near the end
of vm_pageout_flush(). Instead, assert that the page is still write
protected.

Discussed with: tegge


125990 19-Feb-2004 alc

- Correct a long-standing race condition in vm_page_try_to_free() that
could result in a dirty page being unintentionally freed.
- Simplify the dirty page check in vm_page_dontneed().

Reviewed by: tegge
MFC after: 7 days


125889 16-Feb-2004 des

Back out previous commit due to objections.


125882 16-Feb-2004 des

Don't panic if we fail to satisfy an M_WAITOK request; return 0 instead.
The calling code will either handle that gracefully or cause a page fault.


125861 16-Feb-2004 alc

Correct a long-standing race condition in vm_contig_launder() that could
result in a panic "vm_page_cache: caching a dirty page, ...": Access to the
page must be restricted or removed before calling vm_page_cache(). This
race condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.

MFC after: 7 days


125838 15-Feb-2004 alc

Correct a long-standing race condition in vm_fault() that could result in a
panic "vm_page_cache: caching a dirty page, ...": Access to the page must
be restricted or removed before calling vm_page_cache(). This race
condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.

Reviewed by: tegge
MFC after: 7 days


125798 14-Feb-2004 alc

- Correct a long-standing race condition in vm_page_try_to_cache() that
could result in a panic "vm_page_cache: caching a dirty page, ...":
Access to the page must be restricted or removed before calling
vm_page_cache(). This race condition is identical in nature to that
which was addressed by vm_pageout.c's revision 1.251.
- Simplify the code surrounding the fix to this same race condition
in vm_pageout.c's revision 1.251. There should be no behavioral
change. Reviewed by: tegge

MFC after: 7 days


125755 12-Feb-2004 phk

Remove the absolute count g_access_abs() function since experience has
shown that it is not useful.

Rename the relative count g_access_rel() function to g_access(), only
the name has changed.

Change all g_access_rel() calls in our CVS tree to call g_access() instead.

Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source
code compatibility.


125748 12-Feb-2004 alc

Further reduce the use of Giant in vm_map_delete(): Perform pmap_remove()
on system maps, besides the kmem_map, without Giant.

In collaboration with: tegge


125662 10-Feb-2004 alc

Correct a long-standing race condition in the inactive queue scan. (See
the added comment for low-level details.) The effect of this race
condition is a panic "vm_page_cache: caching a dirty page, ..."

Reviewed by: tegge
MFC after: 7 days


125558 07-Feb-2004 alc

swp_pager_async_iodone() no longer requires Giant. Modify bufdone()
and swapgeom_done() to perform swp_pager_async_iodone() without Giant.

Reviewed by: tegge


125470 05-Feb-2004 alc

- Locking for the per-process resource limits structure has eliminated
the need for Giant in vm_map_growstack().
- Use the proc * that is passed to vm_map_growstack() rather than
curthread->td_proc.


125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


125362 02-Feb-2004 jhb

Drop the reference count on the old vmspace after fully switching the
current thread to the new vmspace.

Suggested by: dillon


125322 02-Feb-2004 phk

Check error return from g_clone_bio(). (netchild@)

Add XXX comment about why this is still not optimal. (phk@)

Submitted by: netchild@


125314 02-Feb-2004 jeff

- Use a seperate startup function for the zeroidle kthread. Use this to
set P_NOLOAD prior to running the thread.


125294 01-Feb-2004 jeff

- Fix a problem where we did not drain the cache of buckets in the zone
when uma_reclaim() was called. This was introduced when the zone
working-set algorithm was removed in favor of using the per cpu caches
as the working set.


125246 30-Jan-2004 des

Mechanical whitespace cleanup.


125193 29-Jan-2004 bde

Fixed breakage of scheduling in rev.1.29 of subr_4bsd.c. The
"scheduler" here has very little to do with scheduling. It is actually
the swapper, and it really must be the last SYSINIT'ed item like its
comment says, since proc0 metamorphoses into swapper by calling
scheduler() last in mi_start(), and scheduler() never returns.. Rev.1.29
of subr_4bsd.c broke this by adding another SI_ORDER_FIRST item
(kproc_start() for schedcpu_thread() onto the SI_SUB_RUN_SCHEDULER_LIST.
The sorting of SYSINITs with identical orders (at all levels) is
apparently nondeterministic, so this resulted in schedule() sometimes
being called second last and schedcpu_thread() not being called at all.

This quick fix just changes the code to almost match the comment
(SI_ORDER_FIRST -> SI_ORDER_ANY). "LAST" is misspelled "ANY", and
there is no way to ensure that there is only 1 very lst SYSINIT.
A more complete fix would remove the SYSINIT obfuscation.


124944 25-Jan-2004 jeff

- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.


124933 24-Jan-2004 alc

1. Statically initialize swap_pager_full and swap_pager_almost_full to the
full state. (When swap is added their state will change appropriately.)
2. Set swap_pager_full and swap_pager_almost_full to the full state when
the last swap device is removed.
Combined these changes eliminate nonsense messages from the kernel on swap-
less machines.

Item 2 submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
Prodding by: phk


124649 18-Jan-2004 alc

Increase UMA_BOOT_PAGES because of changes to pv entry initialization in
revision 1.457 of i386/i386/pmap.c.


124646 18-Jan-2004 alc

Don't acquire Giant in vm_object_deallocate() unless the object is vnode-
backed.


124513 14-Jan-2004 alc

Remove vm_page_alloc_contig(). It's now unused.


124366 11-Jan-2004 alc

Remove long dead code, specifically, code related to munmapfd().
(See also vm/vm_mmap.c revision 1.173.)


124353 10-Jan-2004 alc

- Unmanage pages allocated by contigmalloc1(). (There is no point in
having PV entries for these pages.)
- Remove splvm() and splx() calls.


124321 10-Jan-2004 alc

Unmanage pages allocated by kmem_alloc(). (There is no point in having PV
entries for these pages.)


124261 08-Jan-2004 alc

- Enable recursive acquisition of the mutex synchronizing access to the
free pages queue. This is presently needed by contigmalloc1().
- Move a sanity check against attempted double allocation of two pages
to the same vm object offset from vm_page_alloc() to vm_page_insert().
This provides better protection because double allocation could occur
through a direct call to vm_page_insert(), such as that by
vm_page_rename().
- Modify contigmalloc1() to hold the mutex synchronizing access to the
free pages queue while it scans vm_page_array in search of free pages.
- Correct a potential leak of pages by contigmalloc1() that I introduced
in revision 1.20: We must convert all cache queue pages to free pages
before we begin removing free pages from the free queue. Otherwise,
if we have to restart the scan because we are unable to acquire the
vm object lock that is necessary to convert a cache queue page to a
free page, we leak those free pages already removed from the free queue.


124195 06-Jan-2004 alc

Don't bother clearing PG_ZERO in contigmalloc1(), kmem_alloc(), or
kmem_malloc(). It serves no purpose.


124133 04-Jan-2004 alc

Simplify the various pager allocation routines by computing the desired
object size once and assigning that value to a local variable.


124117 04-Jan-2004 alc

Eliminate the acquisition and release of Giant from vnode_pager_alloc().
The vm object and vnode locking should suffice.

Discussed with: jeff


124110 03-Jan-2004 alc

Reduce the scope of Giant in swap_pager_alloc().


124084 02-Jan-2004 alc

Revision 1.74 of vm_meter.c ("Avoid lock-order reversal") makes the release
and subsequent reacquisition of the same vm object lock in
vm_object_collapse() unnecessary.


124083 02-Jan-2004 alc

Avoid lock-order reversal between the vm object list mutex and the vm
object mutex.


124048 01-Jan-2004 alc

- Increase the scope of the kmem_object's lock in kmem_malloc(). Add a
comment explaining why a further increase is not possible.


124028 31-Dec-2003 alc

In vm_page_lookup() check the root of the vm object's splay tree for the
desired page before calling vm_page_splay().


124012 31-Dec-2003 alc

Simplify vm_page_grab(): Don't bother with the generation check. If the
vm object hasn't changed, the desired page will be at or near the root
of the vm object's splay tree, making vm_page_lookup() cheap. (The only
lock required for vm_page_lookup() is already held.) If, however, the
vm object has changed and retry was requested, eliminating the generation
check also eliminates a pointless acquisition and release of the page
queues lock.


124008 30-Dec-2003 alc

- Modify vm_object_split() to expect a locked vm object on entry and
return on a locked vm object on exit. Remove GIANT_REQUIRED.
- Eliminate some unnecessary local variables from vm_object_split().


123948 29-Dec-2003 alc

Remove swap_pager_un_object_list; it is unused.


123914 28-Dec-2003 alc

Remove GIANT_REQUIRED from kmem_suballoc().


123879 26-Dec-2003 alc

- Reduce Giant's scope in vm_fault().
- Use vm_object_reference_locked() instead of vm_object_reference()
in vm_fault().


123878 26-Dec-2003 alc

Minor correction to revision 1.258: Use the proc pointer that is passed to
vm_map_growstack() in the RLIMIT_VMEM check rather than curthread.


123711 22-Dec-2003 alc

- Create an unmapped guard page to trap access to vm_page_array[-1].
This guard page would have trapped the problems with the MFC of the PAE
support to RELENG_4 at an earlier point in the sequence of events.

Submitted by: tegge


123710 22-Dec-2003 alc

- Significantly reduce the number of preallocated pv entries in
pmap_init(). Such a large preallocation is unnecessary and wastes
nearly eight megabytes of kernel virtual address space per gigabyte
of managed physical memory.
- Increase UMA_BOOT_PAGES by two. This enables the removal of
pmap_pv_allocf(). (Note: this function was only used during
initialization, specifically, after pmap_init() but before
pmap_init2(). During pmap_init2(), a new allocator is installed.)


123697 21-Dec-2003 alc

- Correct an error in mincore(2) that has existed since its introduction:
mincore(2) should check that the page is valid, not just allocated.
Otherwise, it can return a false positive for a page that is not yet
resident because it is being read from disk.


123280 08-Dec-2003 kan

Remove trailing whitespace.


123276 08-Dec-2003 alc

Addendum to revision 1.174: In the case where vm_pager_allocate() is called
to create a vnode-backed object, the vnode lock must be held by the caller.

Reported by: truckman
Discussed with: kan


123168 06-Dec-2003 alc

Fix a deadlock between vm_fault() and vm_mmap(): The expected lock ordering
between vm_map and vnode locks is that vm_map locks are acquired first. In
revision 1.150 mmap(2) was changed to pass a locked vnode into vm_mmap().
This creates a lock-order reversal when vm_mmap() calls one of the vm_map
routines that acquires a vm_map lock. The solution implemented herein is
to release the vnode lock in mmap() before calling vm_mmap() and reacquire
this lock if necessary in vm_mmap().

Approved by: re (scottl)
Reviewed by: jeff, kan, rwatson


123126 03-Dec-2003 jhb

Fix all users of mp_maxid to use the same semantics, namely:

1) mp_maxid is a valid FreeBSD CPU ID in the range 0 .. MAXCPU - 1.
2) For all active CPUs in the system, PCPU_GET(cpuid) <= mp_maxid.

Approved by: re (scottl)
Tested on: i386, amd64, alpha


123073 30-Nov-2003 jeff

- Unbreak UP. mp_maxid is not defined on uni-processor machines, although
I believe it and the other MP variables should be. For now, just define it
here and wait for jhb to clean it up later.

Approved by: re (rwatson)


123057 30-Nov-2003 jeff

- Replace the local maxcpu with mp_maxid. Previously, if mp_maxid
was equal to MAXCPU, we would overrun the pcpu_mtx array because maxcpu
was calculated incorrectly.
- Add some more debugging code so that memory leaks at the time of
uma_zdestroy() are more easily diagnosed.

Approved by: re (rwatson)


122902 19-Nov-2003 alc

- Avoid a lock-order reversal between Giant and a system map mutex that
occurs when kmem_malloc() fails to allocate a sufficient number of vm
pages. Specifically, we avoid the lock-order reversal by not grabbing
Giant around pmap_remove() if the map is the kmem_map.

Approved by: re (jhb)
Reported by: Eugene <eugene3@web.de>


122748 15-Nov-2003 tjr

In vnode_pager_input_smlfs(), call VOP_STRATEGY instead of VOP_SPECSTRATEGY
on non-VCHR vnodes. This fixes a panic when reading data from files on a
filesystem with a small (less than a page) block size.

PR: 59271
Reviewed by: alc


122680 14-Nov-2003 alc

- Remove use of Giant from uma_zone_set_obj().


122651 14-Nov-2003 alc

- Remove long dead code.


122646 14-Nov-2003 alc

Changes to msync(2)
- Return EBUSY if the region was wired by mlock(2) and MS_INVALIDATE
is specified to msync(2). This is required by the Open Group Base
Specifications Issue 6.
- vm_map_sync() doesn't return KERN_FAILURE. Thus, msync(2) can't
possibly return EIO.
- The second major loop in vm_map_sync() handles sub maps. Thus,
failing on sub maps in the first major loop isn't necessary.


122384 10-Nov-2003 alc

- The Open Group Base Specifications Issue 6 specifies that an munmap(2)
must return EINVAL if size is zero. Submitted by: tegge
- In order to avoid a race condition in multithreaded applications, the
check and removal operations by munmap(2) must be in the same critical
section. To accomodate this, vm_map_check_protection() is modified to
require its caller to obtain at least a read lock on the map.


122383 10-Nov-2003 mini

NFC: Update stale comments.

Reviewed by: alc


122367 09-Nov-2003 alc

- Remove Giant from msync(2). Giant is still acquired by the lower layers
if we drop into the pmap or vnode layers.
- Migrate the handling of zero-length msync(2)s into vm_map_sync() so that
multithread applications can't change the map between implementing the
zero-length hack in msync(2) and reacquiring the map lock in
vm_map_sync().

Reviewed by: tegge


122349 09-Nov-2003 alc

- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
- Migrate the parts of the old vm_map_clean() that examined the internals
of a vm object to a new function vm_object_sync() that is implemented in
vm_object.c. At the same, introduce the necessary vm object locking so
that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by: tegge


122095 05-Nov-2003 alc

- Move the implementation of OBJ_ONEMAPPING from vm_map_delete() to
vm_map_entry_delete() so that all of the vm object manipulation is
performed in one place.


122034 04-Nov-2003 marcel

Update avail_ssize for rstacks after growing them.


121962 03-Nov-2003 des

Whitespace cleanup.


121919 03-Nov-2003 alc

- Increase the scope of the source object lock in vm_map_copy_entry().


121913 02-Nov-2003 alc

- Increase the scope of two vm object locks in vm_object_split().


121907 02-Nov-2003 alc

- Introduce and use vm_object_reference_locked(). Unlike
vm_object_reference(), this function must not be used to reanimate dead
vm objects. This restriction simplifies locking.

Reviewed by: tegge


121866 01-Nov-2003 alc

- Increase the scope of two vm object locks in vm_object_collapse().
- Remove the acquisition and release of Giant from vm_object_coalesce().


121854 01-Nov-2003 alc

- Modify swap_pager_copy() and its callers such that the source and
destination objects are locked on entry and exit. Add comments to
the callers noting that the locks can be released by swap_pager_copy().
- Remove several instances of GIANT_REQUIRED.


121844 01-Nov-2003 alc

- Additional vm object locking in vm_object_split()
- New vm object locking assertions in vm_page_insert() and
vm_object_set_writeable_dirty()


121821 31-Oct-2003 alc

- Revert a part of revision 1.73: Make vm_object_set_flag() an inline
function. This function is so trivial that inlining reduces the size
of the kernel.


121815 31-Oct-2003 alc

- Take advantage of the swap pager locking: Eliminate the use of Giant
from vm_object_madvise().
- Remove excessive blank lines from vm_object_madvise().


121786 31-Oct-2003 marcel

Fix two bugs introduced with the rstack functionality and specific to
the rstack functionality:
1. Fix a KASSERT that tests for the address to be above the upward
growable stack. Typically for rstack, the faulting address can be
identical to the record end of the upward growable entry, and
very likely is on ia64. The KASSERT tested for greater than, not
greater equal, so whenever the register stack had to be grown
the assertion fired.
2. When we grow the upward growable stack entry and adjust the
unlying object, don't forget to adjust the size of the VM map.
Not doing so would trigger an assert in vm_mapzdtor().

Pointy hat: marcel (for not testing with INVARIANTS).


121782 31-Oct-2003 alc

- Synchronize access to the swdevt's sw_flags with sw_dev_mtx.
- Remove several instances of GIANT_REQUIRED.


121727 30-Oct-2003 alc

- Synchronize access to the swdevt's sw_blist with sw_dev_mtx.
- Remove several instances of GIANT_REQUIRED.


121725 30-Oct-2003 alc

- Synchronize access to swdevhd using sw_dev_mtx.
- Use swp_sizecheck() rather than assignment to swap_pager_full in
swaponsomething().


121649 29-Oct-2003 alc

- Synchronize updates to nswapdev using sw_dev_mtx.


121646 29-Oct-2003 alc

- Avoid a race in swaponsomething(): Calculate the new swdevt's first and
end swblk and insert this new swdevt into the list of swap devices
in the same critical section.


121601 27-Oct-2003 alc

- Complete the synchronization of accesses to the swblock hash table.


121583 26-Oct-2003 alc

- Introduce and use a mutex synchronizing access to the swblock hash table.


121562 26-Oct-2003 alc

- Simplify vm_object_collapse()'s collapse case, reducing the number
of lock acquires and releases performed.
- Move an assertion from vm_object_collapse() to vm_object_zdtor()
because it applies to all cases of object destruction.


121517 25-Oct-2003 alc

- Add some of the required vm object locking, including assertions where
the vm object lock is required and already held.


121511 25-Oct-2003 alc

- Align a comment within struct vm_page.
- Annotate the vm_page's valid field as synchronized by the containing
vm object's lock.


121495 25-Oct-2003 alc

- Call vnode_pager_input_old() with the vm object locked.


121455 24-Oct-2003 alc

- Push down Giant from vm_pageout() to vm_pageout_scan(), freeing
vm_pageout_page_stats() from Giant.
- Modify vm_pager_put_pages() and vm_pager_page_unswapped() to expect the
vm object to be locked on entry. (All of the pager routines now expect
this.)


121351 22-Oct-2003 alc

- Retire vm_pageout_page_free(). Instead, use vm_page_select_cache() from
vm_pageout_scan(). Rationale: I don't like leaving a busy page in the
cache queue with neither the vm object nor the vm page queues lock held.
- Assert that the page is active in vm_pageout_page_stats().


121321 22-Oct-2003 alc

- Assert that every page found in the active queue is an active page.


121313 21-Oct-2003 alc

- Assert that the containing vm object is locked in
vm_page_set_validclean(). (This function reads and modifies the
vm page's valid field, which is synchronized by the lock on the
containing vm object.)


121288 20-Oct-2003 alc

- Remove some long unused code.


121267 20-Oct-2003 alc

- Remove comments referring to functions that no longer exist.


121264 20-Oct-2003 alc

- Hold the vm object's lock around calls to vm_page_set_validclean().


121230 19-Oct-2003 alc

- Synchronize access to a vm page's valid field using the containing
vm object's lock.
- Reduce the scope of the vm page queues lock in two places.


121227 18-Oct-2003 alc

- Synchronize access to the page's valid field in
vnode_pager_generic_getpages() using the containing object's lock.


121226 18-Oct-2003 alc

- Increase the object lock's scope in vm_contig_launder() so that access
to the object's type field and the call to vm_pageout_flush() are
synchronized.
- The above change allows for the eliminaton of the last parameter
to vm_pageout_flush().
- Synchronize access to the page's valid field in vm_pageout_flush()
using the containing object's lock.


121221 18-Oct-2003 alc

Corrections to revision 1.305
- Specifying VM_MAP_WIRE_HOLESOK should not assume that the start
address is the beginning of the map. Instead, move to the first
entry after the start address.
- The implementation of VM_MAP_WIRE_HOLESOK was incomplete. This
caused the failure of mlockall(2) in some circumstances.


121205 18-Oct-2003 phk

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


121199 18-Oct-2003 phk

Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY().
Remove stale comment about B_PHYS.


121150 17-Oct-2003 alc

- Synchronize access to a vm page's valid field using the containing
vm object's lock.
- Release the vm object and vm page queues locks around vput().


121108 15-Oct-2003 alc

- vm_fault_copy_entry() should not assume that the source object contains
every page. If the source entry was read-only, one or more wired pages
could be in backing objects.
- vm_fault_copy_entry() should not set the PG_WRITEABLE flag on the page
unless the destination entry is, in fact, writeable.


120905 08-Oct-2003 alc

Lock the destination object in vm_fault_copy_entry().


120903 08-Oct-2003 alc

Retire vm_page_copy(). Its reason for being ended when peter@ modified
pmap_copy_page() et al. to accept a vm_page_t rather than a physical
address. Also, this change will facilitate locking access to the vm page's
valid field.


120837 06-Oct-2003 bms

Only the super-user should be able to wire pages via the mlock() family
of system calls at this time. Remove various #ifdef's to enforce this.


120831 06-Oct-2003 bms

Move pmap_resident_count() from the MD pmap.h to the MI pmap.h.
Add a definition of pmap_wired_count().
Add a definition of vmspace_wired_count().

Reviewed by: truckman
Discussed with: peter


120824 05-Oct-2003 alc

The addition of a locking assertion to vm_page_zero_invalid() has revealed
a long-time bug: vm_pager_get_pages() assumes that m[reqpage] contains a
valid page upon return from pgo_getpages(). In the case of the device
pager this page has been freed and replaced by a fake page. The fake page
is properly inserted into the vm object but m[reqpage] is left pointing
to a freed page. For now, update m[reqpage] to point to the fake page.

Submitted by: tegge


120811 05-Oct-2003 bms

Revert previous commit. Come back vslock(), all is forgiven.

Pointy hat to: bms


120806 05-Oct-2003 bms

Retire vslock() and vsunlock() with extreme prejudice.

Discussed with: pete


120790 05-Oct-2003 alc

Assert that the containing vm object's lock is held in
vm_page_set_invalid().


120766 04-Oct-2003 alc

Assert that the containing vm object's lock is held in
vm_page_zero_invalid().


120764 04-Oct-2003 alc

Synchronize access to a vm page's valid field using the containing
vm object's lock.


120762 04-Oct-2003 alc

- Extend the scope the vm object lock to cover calls to
vm_page_is_valid().
- Assert that the lock on the containing vm object is held in
vm_page_is_valid().


120761 04-Oct-2003 alc

Synchronize access to a vm page's valid field using the containing
vm object's lock.


120739 04-Oct-2003 jeff

- Use the UMA_ZONE_VM flag on the fakepg and object zones to prevent
vm recursion and LORs. This may be necessary for other zones created in
the vm but this needs to be verified.


120722 03-Oct-2003 alc

Migrate pmap_prefault() into the machine-independent virtual memory layer.

A small helper function pmap_is_prefaultable() is added. This function
encapsulate the few lines of pmap_prefault() that actually vary from
machine to machine. Note: pmap_is_prefaultable() and pmap_mincore() have
much in common. Going forward, it's worth considering their merger.


120538 28-Sep-2003 alc

In vm_page_remove(), assert that the vm object is locked, unless an Alpha.
(The Alpha still requires updates to its pmap.)


120531 27-Sep-2003 marcel

Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64


120526 27-Sep-2003 phk

Provide a bit more help with "memory overwritten after free" style bugs.


120422 25-Sep-2003 peter

Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.


120389 23-Sep-2003 silby

Adjust the kmapentzone limit so that it takes into account the size of
maxproc and maxfiles, as procs, pipes, and other structures cause allocations
from kmapentzone.

Submitted by: tegge


120371 23-Sep-2003 alc

Change the handling of the kernel and kmem objects in vm_map_delete(): In
order to use "unmanaged" pages in the kmem object, vm_map_delete() must
unconditionally perform pmap_remove(). Otherwise, sparc64 has problems.

Tested by: jake


120326 22-Sep-2003 alc

Initialize the page's pindex field even for VM_ALLOC_NOOBJ allocations.
(This field is useful for implementing sanity checks even if the page does
not belong to an object.)


120311 21-Sep-2003 jeff

- Fix MD_SMALL_ALLOC on architectures that support it. Define a new alloc
function, startup_alloc(), that is used for single page allocations prior
to the VM starting up. If it is used after the VM startups up, it
replaces the zone's allocf pointer with either page_alloc() or
uma_small_alloc() where appropriate.

Pointy hat to: me
Tested by: phk/amd64, me/x86


120305 20-Sep-2003 peter

Bad Jeffr! No cookie!

Temporarily disable the UMA_MD_SMALL_ALLOC stuff since recent commits
break sparc64, amd64, ia64 and alpha. It appears only i386 and maybe
powerpc were not broken.


120262 19-Sep-2003 jeff

- Remove the working-set algorithm. Instead, use the per cpu buckets as the
working set cache. This has several advantages. Firstly, we never touch
the per cpu queues now in the timeout handler. This removes one more
reason for having per cpu locks. Secondly, it reduces the size of the zone
by 8 bytes, bringing it under 200 bytes for a single proc x86 box. This
tidies up other logic as well.
- The 'destroy' flag no longer needs to be passed to zone_drain() since it
always frees everything in the zone's slabs.
- cache_drain() is now only called from zone_dtor() and so it destroys by
default. It also does not need the destroy parameter now.


120255 19-Sep-2003 jeff

- Remove the cache colorization code. We can't use it due to all of the
broken consumers of the malloc interface who assume that the allocated
address will be an even multiple of the size.
- Remove disabled time delay code on uma_reclaim(). The comment there said
it all. It was not an effective strategy and it should not be left in
#if 0'd for all eternity.


120249 19-Sep-2003 jeff

- There are an endless stream of style(9) errors in this file. Fix a few.
Also catch some spelling errors.


120229 19-Sep-2003 jeff

- Don't inspect the zone in page_alloc(). It may be NULL.
- Don't cache more items than the zone would like in uma_zalloc_bucket().


120224 19-Sep-2003 jeff

- Move the logic for dealing with the uma_boot_pages cache into the
page_alloc() function from the slab_zalloc() function. This allows us
to unconditionally call uz_allocf().
- In page_alloc() cleanup the boot_pages logic some. Previously memory from
this cache that was not used by the time the system started was left in
the cache and never used. Typically this wasn't more than a few pages,
but now we will use this cache so long as memory is available.


120223 19-Sep-2003 jeff

- Fix the silly flag situation in UMA. Remove redundant ZFLAG/ZONE flags
by accepting the user supplied flags directly. Previously this was not
done so that flags for the same field would not be defined in two
different files. Add comments in each header instructing future
developers on how now to shoot their feet.
- Fix a test for !OFFPAGE which should have been a test for HASH. This would
have caused a panic if we had ever destructed a malloc zone. This also
opens up the possibility that other zones could use the vsetobj() method
rather than a hash.


120221 19-Sep-2003 jeff

- Don't abuse M_DEVBUF, define a tag for UMA hashes.


120219 19-Sep-2003 jeff

- Eliminate a pair of unnecessary variables.


120218 19-Sep-2003 jeff

- Initialize a pool of bucket zones so that we waste less space on zones that
don't cache as many items.
- Introduce the bucket_alloc(), bucket_free() functions to wrap bucket
allocation. These functions select the appropriate bucket zone to
allocate from or free to.
- Rename ub_ptr to ub_cnt to reflect a change in its use. ub_cnt now reflects
the count of free items in the bucket. This gets rid of many unnatural
subtractions by 1 throughout the code.
- Add ub_entries which reflects the number of entries possibly held in a
bucket.


120217 19-Sep-2003 alc

Merge vm_pageout_free_page_calc() into vm_pageout(), eliminating some
unneeded code.


120183 18-Sep-2003 alc

Add vm object locking to vnode_pager_lock(). (This triggers the movement
of a VM_OBJECT_LOCK() in vm_fault().)


120152 17-Sep-2003 alc

Remove GIANT_REQUIRED from vm_object_shadow().


120150 17-Sep-2003 alc

When calling vget() on a vnode-backed vm object, acquire the vnode
interlock before releasing the vm object's lock.


120086 15-Sep-2003 alc

Eliminate the use of Giant from vm_object_reference().


120050 14-Sep-2003 alc

Call vm_page_unmanage() on pages belonging to the kmem_object. This
eliminates the unnecessary overhead of managing "PV" entries for these
pages.


120035 13-Sep-2003 alc

There is no need for an atomic increment on the vm object's generation
count in _vm_object_allocate(). (Access to the generation count is
governed by the vm object's lock.) Note: the introduction of the
atomic increment in revision 1.238 appears to be an accident. The
purpose of that commit was to fix an Alpha-specific bug in UMA's
debugging code.


119999 12-Sep-2003 alc

Add a new parameter to pmap_extract_and_hold() that is needed to eliminate
Giant from vmapbuf().

Idea from: tegge


119869 08-Sep-2003 alc

Introduce a new pmap function, pmap_extract_and_hold(). This function
atomically extracts and holds the physical page that is associated with the
given pmap and virtual address. Such a function is needed to make the
memory mapping optimizations used by, for example, pipes and raw disk I/O
MP-safe.

Reviewed by: tegge


119858 07-Sep-2003 alc

Revise the locking in mincore(2).


119663 02-Sep-2003 phk

Don't open with exclusive bit, swapon(8) wants to trash our swapdev.

Add XXX comment with a rating of this concept.


119658 01-Sep-2003 eivind

Change clean_map from a global to an auto variable


119596 31-Aug-2003 alc

- Add vm object locking to the part of vm_pageout_scan() that launders
dirty pages.
- Remove some unused variables.


119595 30-Aug-2003 marcel

Introduce MAP_ENTRY_GROWS_DOWN and MAP_ENTRY_GROWS_UP to allow for
growable (stack) entries that not only grow down, but also grow up.
Have vm_map_growstack() take these flags into account when growing
an entry.

This is the first step in adding support for upward growable stacks.
It is a required feature on ia64 to support the register stack (or
rstack as I like to call it -- it also means reverse stack). We do
not currently create rstacks, so the upward growing is not exercised
and the change should be a functional no-op.

Reviewed by: alc


119591 30-Aug-2003 phk

Add a close() method to a swapdev.

Add a GEOM based backend.

Remove the device/VOP_SPECSTRATEGY() based backend.


119590 30-Aug-2003 phk

Protect the swapdevice tailq with a mutex.

Store the udev_t we will report to userland in the swdevt.


119575 30-Aug-2003 phk

Continue the objectification of the swapdev backends:

Remove the vnode and dev_t fields and replace them with a void *.

Introduce separate strategy functions for devices and regular (NFS)
vnodes.

For devices we don't need the vnode v_numoutput stuff.

Add a generic swaponsomething() function to add a swapdevice and
split the remainder of swaponvp() into swaponvp() and swapondev()
which calls this backend.


119574 30-Aug-2003 phk

Make the strategy function a method of the individual swapdev.


119573 30-Aug-2003 phk

Consistent use modern function definitions


119544 29-Aug-2003 marcel

In vnode_pager_generic_putpages(), change the printf format specifier
to long and explicitly cast field dirty of struct vm_page to unsigned
long. When PAGE_SIZE is 32K, this field is actually unsigned long.


119543 28-Aug-2003 alc

Recent pmap changes permit the use of a more precise locking assertion
in vm_page_lookup().


119468 25-Aug-2003 marcel

Assert that u_long is at least 64 bits if PAGE_SIZE is 32K.

Suggested by: phk


119373 23-Aug-2003 alc

Held pages, just like wired pages, should not be added to the cache queues.

Submitted by: tegge


119370 23-Aug-2003 alc

Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().


119357 23-Aug-2003 alc

To implement the sequential access optimization, vm_fault() may need to
reacquire the "first" object's lock while a backing object's lock is held.
Since this is a lock-order reversal, vm_fault() uses trylock to acquire
the first object's lock, skipping the sequential access optimization in
the unlikely event that the trylock fails.


119356 23-Aug-2003 marcel

Also define VM_PAGE_BITS_ALL for 16K and 32K pages. Make the constant
unsigned for all page sizes and unsigned long for 32K pages.


119354 23-Aug-2003 marcel

Add support for 16K and 32K page sizes. The valid and dirty maps
in struct vm_page are defined as u_int for 16K pages and u_long
for 32K pages, with the implied assumption that long will at least
be 64 bits wide on platforms where we support 32K pages.


119247 21-Aug-2003 alc

Assert that the vm object's lock is held on entry to vm_page_grab(); remove
code from this function that was needed when vm object locking was
incomplete.


119186 20-Aug-2003 alc

Assert that the vm object lock is held in vm_page_alloc().


119182 20-Aug-2003 bmilekic

In sysctl_vm_zone, do not calculate per-cpu cache stats on
UMA_ZFLAG_INTERNAL zones at all. Apparently, Wilko's alpha
was crashing while entering multi-user because, I think, we
were calculating the garbage cachefree for pcpu caches that
essentially don't exist for at least the 'zones' zone and it so
happened that we were reading from an unmapped location.

Confirmed to fix crash: wilko
Helped debug: wilko, gallatin


119092 18-Aug-2003 phk

Replace a homegrown bdone()/bwait() implementation by the real thing


119059 18-Aug-2003 alc

Three unrelated changes to vm_proc_new(): (1) add vm object locking on the
U pages object; (2) reorganize such that the U pages object is created and
filled in one block; and (3) remove an unnecessary clearing of PG_ZERO.


119045 17-Aug-2003 phk

Use NULL for 3rd argument of VOP_BMAP() rather than custom cast.
Eliminate unused variable.


119004 16-Aug-2003 marcel

In vm_thread_swap{in|out}(), remove the alpha specific conditional
compilation and replace it with a call to cpu_thread_swap{in|out}().
This allows us to add similar code on ia64 without cluttering the
code even more.


118946 15-Aug-2003 phk

Eliminate unnecessary udev_t variable: we can derive it from the dev_t
when we need it.


118945 15-Aug-2003 phk

Make swaponvp() static to the swap_pager.


118931 15-Aug-2003 alc

Extend the scope of the page queues lock in vm_pageout_scan() to cover
the traversal of the PQ_INACTIVE queue.


118878 13-Aug-2003 alc

Remove GIANT_REQUIRED from vmspace_alloc().


118852 13-Aug-2003 alc

Reduce the size of the vm map (and by inclusion the vm space) on 64-bit
architectures by moving a field within the structure.


118848 12-Aug-2003 imp

Expand inline the relevant parts of src/COPYRIGHT for Matt Dillon's
copyrighted files.

Approved by: Matt Dillon


118838 12-Aug-2003 alc

Reduce the size of the vm object on 64-bit architectures by moving
a field within the structure.


118795 11-Aug-2003 bmilekic

- When deciding whether to init the zone with small_init or large_init,
compare the zone element size (+1 for the byte of linkage) against
UMA_SLAB_SIZE - sizeof(struct uma_slab), and not just UMA_SLAB_SIZE.
Add a KASSERT in zone_small_init to make sure that the computed
ipers (items per slab) for the zone is not zero, despite the addition
of the check, just to be sure (this part submitted by: silby)

- UMA_ZONE_VM used to imply BUCKETCACHE. Now it implies
CACHEONLY instead. CACHEONLY is like BUCKETCACHE in the
case of bucket allocations, but in addition to that also ensures that
we don't setup the zone with OFFPAGE slab headers allocated from the
slabzone. This means that we're not allowed to have a UMA_ZONE_VM
zone initialized for large items (zone_large_init) because it would
require the slab headers to be allocated from slabzone, and hence
kmem_map. Some of the zones init'd with UMA_ZONE_VM are so init'd
before kmem_map is suballoc'd from kernel_map, which is why this
change is necessary.


118771 11-Aug-2003 bms

Add the mlockall() and munlockall() system calls.
- All those diffs to syscalls.master for each architecture *are*
necessary. This needed clarification; the stub code generation for
mlockall() was disabled, which would prevent applications from
linking to this API (suggested by mux)
- Giant has been quoshed. It is no longer held by the code, as
the required locking has been pushed down within vm_map.c.
- Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
to express their intention explicitly.
- Inspected at the vmstat, top and vm pager sysctl stats level.
Paging-in activity is occurring correctly, using a test harness.
- The RES size for a process may appear to be greater than its SIZE.
This is believed to be due to mappings of the same shared library
page being wired twice. Further exploration is needed.
- Believed to back out of allocations and locks correctly
(tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).

PR: kern/43426, standards/54223
Reviewed by: jake, alc
Approved by: jake (mentor)
MFC after: 2 weeks


118764 11-Aug-2003 silby

More pipe changes:

From alc:
Move pageable pipe memory to a seperate kernel submap to avoid awkward
vm map interlocking issues. (Bad explanation provided by me.)

From me:
Rework pipespace accounting code to handle this new layout, and adjust
our default values to account for the fact that we now have a solid
limit on allocations.

Also, remove the "maxpipes" limit, as it no longer has a purpose.
(The limit on kva usage solves the problem of having two many pipes.)


118544 06-Aug-2003 phk

Make the first two pages magic to protect the BSD labels rather than
only one.


118537 06-Aug-2003 phk

Remove an unused variable.


118536 06-Aug-2003 phk

Staticize swap_pager_putpages()

Eliminate a lot of checkes to make sure requests are not cross-device
which is unnecessary with the new layout. We know a sequential request
cannot possibly be cross-device because there is a reserved page between
the devices.

Remove a couple of comments which no longer are relevant.


118535 06-Aug-2003 phk

Access the swap_pagers' ->putpages() through swappagerops instead
of directly, this is a cleaner way to do it.


118528 06-Aug-2003 phk

Add XXX: comment to vm_pager_unswapped().


118527 06-Aug-2003 phk

Explicitly set B_PAGING


118521 06-Aug-2003 phk

Rip out the totally bogos vnode swapdev_vp with extreeme prejudice.

Don't mark buffers with B_KEEPGIANT, we don't drop giant in strategy
at this point in time.


118468 05-Aug-2003 phk

Use sparse struct initialization for struct pagerops.

Mark our buffers B_KEEPGIANT before sending them downstream.

Remove swap_pager_strategy implementation.


118466 05-Aug-2003 phk

Use sparse struct initializations for struct pagerops.

This makes grepping for which pagers implement which methods easier.


118418 04-Aug-2003 phk

Put an uncovered page between the swap devices, that way we can be sure
to not get any cross-device I/O requests. (The unallocated first page
protecting BSD labels already gave us this, but that hack may go away
at some point in time).

Remove the check for cross-device I/O requests in swap_pager_strategy.

Move the repeated statistics updating into flushchainbuf().


118413 04-Aug-2003 alc

Use kmem_alloc_nofault() instead of kmem_alloc_pageable() to allocate
swapbkva. Swapbkva mappings are explicitly managed using pmap_qenter(),
not on-demand by vm_fault(), making kmem_alloc_nofault() more appropriate.

Submitted by: tegge


118398 03-Aug-2003 phk

Name swap_pager_find_dev() more correctly swp_pager_finde_dev().

Use ->bio_children to count child buffers, rather than abuse the
bio_caller1 pointer.

Expand the relevant bits of waitchainbuf() inline, this clarifies
the code a little bit.


118392 03-Aug-2003 phk

I accidentally hit undo before committing, fix the resulting off-by-one.


118390 03-Aug-2003 phk

Change the layout policy of the swap_pager from a hardcoded width
striping to a per device round-robin algorithm.

Because of the policy of not attempting to retain previous swap
allocation on page-out, this means that a newly added swap device
almost instantly takes its 1/N share of the I/O load but it takes
somewhat longer for it to assume it's 1/N share of the pages if there
is plenty of space on the other devices.

Change the 8G total swapspace limitation to 8G per device instead
by using a per device blist rather than one global blist. This
reduces the memory footprint by 75% (typically a couple hundred
kilobytes) for the common case with one swapdevice but NSWAPDEV=4.

Remove the compile time constant limit of number of swap devices,
there is no limit now. Instead of a fixed size array, store the
per swapdev structure in a TAILQ.

Total swap space is still addressed by a 32 bit page number and
therefore the upper limit is now 2^42 bytes = 16TB (for i386).

We still do not allocate the first page of each device in order to
give some amount of protection to any bsdlabel at the start of the
device.

A new device is appended after the existing devices in the swap space,
no attempt is made to fill in holes left behind by swapoff (this can
trivially be changed should it ever become a problem).

The sysctl vm.nswapdev now reflects the number of currently configured
swap devices.

Rename vm_swap_size to swap_pager_avail for consistency with other
exported names.

Change argument type for vm_proc_swapin_all() and swap_pager_isswapped()
to be a struct swdevt pointer rather than an index.

Not changed: we are still using blists to manage the free space,
but since the swapspace is no longer fragmented by the striping
different resource managers might fare better.


118384 03-Aug-2003 phk

Move extern declaration of the various pagerops from vm_pager.c
to vm_pager.h where the various pagers will also see them.


118380 03-Aug-2003 alc

Revise obj_alloc(). Most notably, use the object's lock to prevent two
concurrent invocations from acquiring the same address(es). Also, in case
of an incomplete allocation, free any allocated pages.

In collaboration with: tegge


118369 02-Aug-2003 bmilekic

When INVARIANTS is on and we're in uma_zalloc_free(), we need to make
sure that uma_dbg_free() is called if we're about to call
uma_zfree_internal() but we're asking it to skip the dtor and
uma_dbg_free() call itself. So, if we're about to call
uma_zfree_internal() from uma_zfree_arg() and skip == 1, call
uma_dbg_free() ourselves.


118317 01-Aug-2003 alc

Update the comment at the head of kmem_alloc_nofault() to describe its
purpose and use.


118315 01-Aug-2003 bmilekic

Only free the pcpu cache buckets if they are non-NULL.

Crashed this person's machine: harti
Pointy-hat to: me


118286 31-Jul-2003 phk

Remove unused stuff.

Move used stuff to swap_pager.c where it belongs.

This file no longer exports anything to userland.


118234 31-Jul-2003 peter

Add #include "opt_kstack_pages.h" and "opt_kstack_max_pages.h" to remain
in sync with the backend machdep code. When cpu_thread_init() does not
have the same idea of KSTACK_PAGES as the thing that created the kstack,
all hell breaks loose.

Bad alc! no cookie! :-)


118221 30-Jul-2003 bmilekic

Plug a race and a leak in UMA.

1) The race has to do with zone destruction. From the zone destructor we
would lock the zone, set the working set size to 0, then unlock the zone,
drain it, and then free the structure. Within the window following the
working-set-size set to 0 and unlocking of the zone and the point where
in zone_drain we re-acquire the zone lock, the uma timer routine could
have fired off and changed the working set size to something non-zero,
thereby potentially preventing us from completely freeing slabs before
destroying the zone (and thus leaking them).

2) The leak has to do with zone destruction as well. When destroying a
zone we would take care to free all the buckets cached in the zone, but
although we would drain the pcpu cache buckets, we would not free them.
This resulted in leaking a couple of bucket structures (512 bytes each)
per cpu on SMP during zone destruction.

While I'm here, also silence GCC warnings by turning uma_slab_alloc()
from inline to real function. It's too big to be an inline.

Reviewed by: JeffR


118212 30-Jul-2003 bmilekic

When generating the zone stats make sure to handle the master zone
("UMA Zone") carefully, because it does not have pcpu caches allocated
at all. In the UP case, we did not catch this because one pcpu cache
is always allocated with the zone, but for the MP case, we were getting
bogus stats for this zone.

Tested by: Lukas Ertl <le@univie.ac.at>


118201 30-Jul-2003 phk

Remove the disabling of buckets workaround.

Thanks to: jeffr


118190 30-Jul-2003 jeff

- Get rid of the ill-conceived uz_cachefree member of uma_zone.
- In sysctl_vm_zone use the per cpu locks to read the current cache
statistics this makes them more accurate while under heavy load.

Submitted by: tegge


118189 30-Jul-2003 jeff

- Check to see if we need a slab prior to allocating one. Failure to do
so not only wastes memory but it can also cause a leak in zones that
will be destroyed later. The problem is that the slab allocation code
places newly created slabs on the partially allocated list because it
assumes that the caller will actually allocate some memory from it.
Failure to do so places an otherwise free slab on the partial slab list
where we wont find it later in zone_drain().

Continuously prodded to fix by: phk (Thanks)


118187 29-Jul-2003 phk

Temporary workaround: Always disable buckets, there is a bug there
somewhere.

JeffR will look at this as soon as he has time.

OK'ed by: jeffr


118104 28-Jul-2003 alc

None of the "alloc" functions used by UMA assume that Giant is held any
longer. (If they still need it, e.g., contigmalloc(), they acquire it
themselves.) Therefore, we need not acquire Giant in slab_zalloc().


118096 27-Jul-2003 alc

Remove GIANT_REQUIRED from kmem_alloc().


118076 27-Jul-2003 mux

Use pmap_zero_page() to zero pages instead of bzero() because
they haven't been vm_map_wire()'d yet.


118074 27-Jul-2003 alc

Allow vm_object_reference() on kernel_object without Giant.


118071 26-Jul-2003 alc

Acquire Giant rather than asserting it is held in contigmalloc(). This is
a prerequisite to removing further uses of Giant from UMA.


118047 26-Jul-2003 phk

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


118040 26-Jul-2003 alc

Gulp ... call kmem_malloc() without Giant.


118029 25-Jul-2003 mux

Add support for the M_ZERO flag to contigmalloc().

Reviewed by: jeff


117903 22-Jul-2003 phk

Remove all but one of the inlines here, this reduces the code size by
2032 bytes and has no measurable impact on performance.


117876 22-Jul-2003 phk

Don't inline very large functions.

Gcc has silently not been doing this for a long time.


117866 22-Jul-2003 peter

swp_pager_hash() was called before it was instantiated inline. This made
gcc (quite rightly) unhappy. Move it earlier.


117747 18-Jul-2003 phk

Fix a printf format warning I introduced.
Use the macro max number of swap devices rather than cache the constant
in a variable.
Avoid a (now) pointless variable.


117736 18-Jul-2003 harti

When INVARIANTS is defined make sure that uma_zalloc_arg (and hence
uma_zalloc) is called with exactly one of either M_WAITOK or M_NOWAIT and
that it is called with neither M_TRYWAIT or M_DONTWAIT. Print a warning
if anything is wrong. Default to M_WAITOK of no flag is given. This is the
same test as in malloc(9).


117725 18-Jul-2003 phk

If a proposed swap device exceeds the 8G artificial limit which out
radix-tree code imposes, truncate the device instead of rejecting it.


117724 18-Jul-2003 phk

Move the implementation of the vmspace_swap_count() (used only in
the "toss the largest process" emergency handling) from vm_map.c to
swap_pager.c.

The quantity calculated depends strongly on the internals of the
swap_pager and by moving it, we no longer need to expose the
internal metrics of the swap_pager to the world.


117723 18-Jul-2003 phk

Add a new function swap_pager_status() which reports the total size of the
paging space and how much of it is in use (in pages).

Use this interface from the Linuxolator instead of groping around in the
internals of the swap_pager.


117722 18-Jul-2003 phk

Merge swap_pager.c and vm_swap.c into swap_pager.c, the separation
is not natural and needlessly exposes a lot of dirty laundry.

Move private interfaces between the two from swap_pager.h to swap_pager.c
and staticize as much as possible.

No functional change.


117702 17-Jul-2003 phk

Make sure that SWP_NPAGES always has the same value in all source
files, so that SWAP_META_PAGES does not vary either.

swap_pager.c ended up with a value of 16, everybody else 8. Go with
the 16 for now.

This should only have any effect in the "kill processes because we
are out of swap" scenario, where it will make some sort of estimate
of something more precise.


117519 13-Jul-2003 robert

Avoid an unnecessary calculation: there is no need to subtract
`firstaddr' from `v' if we know that the former equals zero.


117303 07-Jul-2003 alc

- Complete the vm object locking in vm_pageout_object_deactivate_pages().
- Change vm_pageout_object_deactivate_pages()'s first parameter from a
vm_map_t to a pmap_t.
- Change vm_pageout_object_deactivate_pages()'s and
vm_pageout_map_deactivate_pages()'s last parameter from a vm_pindex_t
to a long. Since the number of pages in an address space doesn't
require 64 bits on an i386, vm_pindex_t is overkill.


117262 05-Jul-2003 alc

Lock a vm object when freeing a page from it.


117224 04-Jul-2003 phk

Remove unnecessary cast.


117206 03-Jul-2003 alc

Background: pmap_object_init_pt() premaps the pages of a object in
order to avoid the overhead of later page faults. In general, it
implements two cases: one for vnode-backed objects and one for
device-backed objects. Only the device-backed case is really
machine-dependent, belonging in the pmap.

This commit moves the vnode-backed case into the (relatively) new
function vm_map_pmap_enter(). On amd64 and i386, this commit only
amounts to code rearrangement. On alpha and ia64, the new machine
independent (MI) implementation of the vnode case is smaller and more
efficient than their pmap-based implementations. (The MI
implementation takes advantage of the fact that objects in -CURRENT
are ordered collections of pages.) On sparc64, pmap_object_init_pt()
hadn't (yet) been implemented.


117143 02-Jul-2003 mux

Fix a few style(9) nits.


117094 01-Jul-2003 alc

Modify vm_page_alloc() and vm_page_select_cache() to allow the page that
is returned by vm_page_select_cache() to belong to the object that is
already locked by the caller to vm_page_alloc().


117093 01-Jul-2003 alc

Check the address provided to vm_map_stack() against the vm map's maximum,
returning an error if the address is too high.


117047 29-Jun-2003 alc

Introduce vm_map_pmap_enter(). Presently, this is a stub calling the MD
pmap_object_init_pt().


117045 29-Jun-2003 alc

- Export pmap_enter_quick() to the MI VM. This will permit the
implementation of a largely MI pmap_object_init_pt() for vnode-backed
objects. pmap_enter_quick() is implemented via pmap_enter() on sparc64
and powerpc.
- Correct a mismatch between pmap_object_init_pt()'s prototype and its
various implementations. (I plan to keep pmap_object_init_pt() as
the MD hook for device-backed objects on i386 and amd64.)
- Correct an error in ia64's pmap_enter_quick() and adjust its interface
to match the other versions. Discussed with: marcel


117038 29-Jun-2003 alc

Add vm object locking to vm_pageout_map_deactivate_pages().


117004 28-Jun-2003 alc

Remove GIANT_REQUIRED from kmem_malloc().


117001 28-Jun-2003 alc

- Add vm object locking to vm_pageout_clean().


116959 28-Jun-2003 alc

- Use an int rather than a vm_pindex_t to represent the desired page
color in vm_page_alloc(). (This also has small performance benefits.)
- Eliminate vm_page_select_free(); vm_page_alloc() might as well
call vm_pageq_find() directly.


116923 27-Jun-2003 alc

Simple read-modify-write operations on a vm object's flags, ref_count, and
shadow_count can now rely on its mutex for synchronization. Remove one use
of Giant from vm_map_insert().


116885 26-Jun-2003 alc

vm_page_select_cache() enforces a number of conditions on the returned
page. Add the ability to lock the containing object to those conditions.


116860 26-Jun-2003 alc

Modify vm_pageq_requeue() to handle a PQ_NONE page without dereferencing
a NULL pointer; remove some now unused code.


116837 25-Jun-2003 bmilekic

Move the pcpu lock out of the uma_cache and instead have a single set
of pcpu locks. This makes uma_zone somewhat smaller (by (LOCKNAME_LEN *
sizeof(char) + sizeof(struct mtx) * maxcpu) bytes, to be exact).

No Objections from jeff.


116829 25-Jun-2003 bmilekic

Make sure that the zone destructor doesn't get called twice in
certain free paths.


116799 25-Jun-2003 alc

Remove a GIANT_REQUIRED on the kernel object that we no longer need.


116798 25-Jun-2003 alc

Maintain the lock on a vm object when calling vm_page_grab().


116793 24-Jun-2003 alc

Assert that the vm object is locked on entry to dev_pager_getpages().


116710 23-Jun-2003 alc

Assert that the vm object is locked on entry to vm_pager_get_pages().


116695 22-Jun-2003 alc

Maintain a lock on the vm object of interest throughout vm_fault(),
releasing the lock only if we are about to sleep (e.g., vm_pager_get_pages()
or vm_pager_has_pages()). If we sleep, we have marked the vm object with
the paging-in-progress flag.


116678 22-Jun-2003 phk

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


116667 22-Jun-2003 alc

As vm_fault() descends the chain of backing objects, set paging-in-
progress on the next object before clearing it on the current object.


116662 22-Jun-2003 alc

Complete the vm object locking in vm_object_backing_scan(); specifically,
deal with the case where we need to sleep on a busy page with two vm object
locks held.


116658 22-Jun-2003 alc

Make some style and white-space changes to the copy-on-write path through
vm_fault(); remove a pointless assignment statement from that path.


116653 21-Jun-2003 phk

Use a do {...} while (0); and a couple of breaks to reduce the level
of indentation a bit.


116650 21-Jun-2003 alc

Lock one of the vm objects involved in an optimized copy-on-write fault.


116645 21-Jun-2003 alc

- Increase the scope of the vm object lock in vm_object_collapse().
- Assert that the vm object and its backing vm object are both locked in
vm_object_qcollapse().


116629 20-Jun-2003 alc

Make swap_pager_haspages() static; remove unused function prototypes.


116605 20-Jun-2003 phk

Initialize b_saveaddr when we hand out pbufs


116596 20-Jun-2003 alc

The so-called "optimized copy-on-write fault" case should not require
the vm map lock. What's really needed is vm object locking, which
is (for the moment) provided Giant.

Reviewed by: tegge


116554 19-Jun-2003 alc

Assert that the vm object is locked in vm_page_try_to_free().


116552 19-Jun-2003 alc

Fix a vm object reference leak in the page-based copy-on-write mechanism
used by the zero-copy sockets implementation.

Reviewed by: gallatin


116512 18-Jun-2003 alc

Lock the vm object when freeing a vm page.


116437 16-Jun-2003 phk

This file was ignored by CVS in my last commit for some reason:

Remove pointless initialization of b_spc field, which now no longer
exists.


116412 15-Jun-2003 phk

Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations
to check that the buffer points to the correct vnode.


116387 15-Jun-2003 alc

Remove an unnecessary forward declaration.


116359 15-Jun-2003 alc

Use #ifdef __alpha__, not __alpha.


116355 14-Jun-2003 alc

Migrate the thread stack management functions from the machine-dependent
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.

Two details:

1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.

2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().


116328 14-Jun-2003 alc

Move the *_new_altkstack() and *_dispose_altkstack() functions out of the
various pmap implementations into the machine-independent vm. They were
all identical.


116280 13-Jun-2003 alc

Extend the scope of the vm object lock in swp_pager_async_iodone() to cover
a vm_page_free().


116279 13-Jun-2003 alc

Add vm object locking to various pagers' "get pages" methods, i386 stack
management functions, and a u area management function.


116226 11-Jun-2003 obrien

Use __FBSDID().


116188 11-Jun-2003 peter

GC unused cpu_wait() function


116167 10-Jun-2003 alc

- Finish vm object and page locking in vnode_pager_setsize().
- Make some small style changes to vnode_pager_setsize(); most notably,
move two comments to a more logical place.


116131 09-Jun-2003 phk

Revert last commit, I have no idea what happened.


116117 09-Jun-2003 phk

A white-space nit I noticed.


116080 09-Jun-2003 alc

Hold the vm object's lock when performing vm_page_lookup().


116079 09-Jun-2003 alc

Don't use vm_object_set_flag() to initialize the vm object's flags.


116067 08-Jun-2003 alc

- Properly handle the paging_in_progress case on two vm objects in
vm_object_deallocate().
- Remove vm_object_pip_sleep().


115997 07-Jun-2003 alc

Lock the kernel object in kmem_alloc().


115996 07-Jun-2003 alc

Teach vm_page_grab() how to handle the vm object's lock.


115987 07-Jun-2003 alc

Assert that the vm object is locked on entry to swap_pager_freespace().


115931 07-Jun-2003 alc

Pass the vm object to vm_object_collapse() with its lock held.


115883 05-Jun-2003 phk

Fix NFS file swapping, I broke it 3 months ago it seems.


115879 05-Jun-2003 alc

- Extend the scope of the backing object's lock in vm_object_collapse().


115856 04-Jun-2003 alc

- Add further vm object locking to vm_object_deallocate(), specifically,
for accessing a vm object's shadows.


115853 04-Jun-2003 alc

- Add VM_OBJECT_TRYLOCK().


115818 04-Jun-2003 alc

- Add vm object locking to vm_object_deallocate(). (Still more
changes are required.)
- Remove special-case macros for kmem object locking. They are
no longer used.


115782 03-Jun-2003 alc

Add vm object locking to vm_object_coalesce().


115655 01-Jun-2003 alc

Change kernel_object and kmem_object to (&kernel_object_store) and
(&kmem_object_store), respectively. This allows the address of these
objects to be resolved at link-time rather than run-time.


115523 31-May-2003 phk

Prepend _ to internal union members to avoid ambiguity.

Found by: FlexeLint


115522 31-May-2003 phk

Remove unused variables

Found by: FlexeLint


115516 31-May-2003 alc

Add vm object locking to vm_object_madvise().


115146 19-May-2003 das

If we seem to be out of VM, don't allow the pagedaemon to kill
processes in the first pass. Among other things, this will give
us a chance to launder vnode-backed pages before concluding that
we need more swap. This is particularly useful for systems that
have no swap.

While here, update a comment and remove some long-unused code.

Reported by: Lucky Green <shamrock@cypherpunks.to>
Suggested by: dillon
Approved by: re (rwatson)


115127 18-May-2003 alc

Reduce the size of a vm object by converting its shadow list from a TAILQ
to a LIST.

Approved by: re (rwatson)


114983 13-May-2003 jhb

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


114850 09-May-2003 alc

Give the kmem object's mutex a unique name, instead of "vm object",
to avoid false reports of lock-order reversal with a system map mutex.

Approved by: re (jhb)


114774 06-May-2003 alc

Lock the vm_object when performing vm_pager_deallocate().


114669 04-May-2003 alc

Extend the scope of the vm_object lock in vm_object_terminate().


114649 04-May-2003 alc

Avoid a lock-order reversal and implement vm_object locking
in vm_pageout_page_free().


114599 03-May-2003 alc

Lock the vm_object on entry to vm_object_vndeallocate().


114570 03-May-2003 alc

- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.

Reported by: iedowse


114564 03-May-2003 alc

Move a declaration to its proper place.


114489 02-May-2003 alc

Lock the vm_object when updating its shadow list.


114487 02-May-2003 alc

Simplify the removal of a shadow object in vm_object_collapse().


114387 01-May-2003 alc

Extend the scope of the vm_object locking in vm_object_split().


114372 01-May-2003 alc

- Update the vm_object locking in vm_object_reference().
- Convert some dead code in vm_object_reference() into a comment.


114317 30-Apr-2003 alc

Increase the scope of the vm_object lock in vm_map_delete().


114273 30-Apr-2003 alc

Eliminate an unused parameter from vm_pageout_object_deactivate_pages().


114263 30-Apr-2003 alc

Add vm_object locking to vmspace_swap_count().


114245 29-Apr-2003 alc

Remove unused declarations and definitions.


114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


114166 28-Apr-2003 alc

- Lock the vm_object when performing swap_pager_isswapped().
- Assert that the vm_object is locked in swap_pager_isswapped().


114149 28-Apr-2003 alc

uma_zone_set_obj() must perform VM_OBJECT_LOCK_INIT() if the caller
provides storage for the vm_object.


114145 28-Apr-2003 alc

- Define VM_OBJECT_LOCK_INIT().
- Avoid repeatedly mtx_init()ing and mtx_destroy()ing the vm_object's lock
using UMA's uminit callback, in this case, vm_object_zinit().


114128 27-Apr-2003 alc

- Tell witness that holding two or more vm_object locks is okay.
- In vm_object_deallocate(), lock the child when removing the parent
from the child's shadow list.


114112 27-Apr-2003 alc

Various changes to vm_object_shadow(): (1) update the vm_object locking,
(2) remove a pointless assertion, and (3) make a trivial change to a
comment.


114091 26-Apr-2003 alc

Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.


114080 26-Apr-2003 alc

- Lock the vm_object on entry to vm_object_terminate().


114074 26-Apr-2003 alc

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


114053 26-Apr-2003 alc

- Extend the scope of two existing vm_object locks to cover
swap_pager_freespace().


114052 26-Apr-2003 alc

Remove an XXX comment. It is no longer a problem.


114030 25-Apr-2003 jhb

- Don't bother using the proc lock to test just P_SYSTEM as that is set in
fork1() and never changes.
- The proc lock is enough to cover reading p_state, so push down sched_lock
into the PRS_NORMAL case of the switch on p_state.


114019 25-Apr-2003 alc

- Lock the vm_object when iterating over its list of resident pages.


114003 25-Apr-2003 alc

- Relax the Giant required in vm_page_remove().
- Remove the Giant required from vm_page_free_toq(). (Any locking
errors will be caught by vm_page_remove().)

This remedies a panic that occurred when kmem_malloc(NOWAIT) performed
without Giant failed to allocate the necessary pages.

Reported by: phk


113956 24-Apr-2003 alc

- Move swap_pager_isswapped()'s prototype to a more logical place.


113955 24-Apr-2003 alc

- Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)


113918 23-Apr-2003 jhb

Fix compiling in the NO_SWAPPING case.

Submitted by: bde (partially)


113869 22-Apr-2003 jhb

Lock the proc to check p_flag and several other related tests in
vm_daemon(). We don't need to hold sched_lock as long now as a result.


113868 22-Apr-2003 jhb

Prefer the proc lock to sched_lock when testing PS_INMEM now that it is
safe to do so.


113867 22-Apr-2003 jhb

- Always call faultin() in _PHOLD() if PS_INMEM is clear. This closes a
race where a thread could assume that a process was swapped in by
PHOLD() when it actually wasn't fully swapped in yet.
- In faultin(), always msleep() if PS_SWAPPINGIN is set instead of doing
this check after bumping p_lock in the PS_INMEM == 0 case. Also,
sched_lock is only needed for setting and clearning swapping PS_*
flags and the swap thread inhibitor.
- Don't set and clear the thread swap inhibitor in the same loops as the
pmap_swapin/out_thread() since we have to do it under sched_lock.
Instead, mimic the treatment of the PS_INMEM flag and use separate loops
to set the inhibitors when clearing PS_INMEM and clear the inhibitors
when setting PS_INMEM.
- swapout() now returns with the proc lock held as it holds the lock
while adjusting the swapping-related PS_* flags so that the proc lock
can be used to test those flags.
- Only use the proc lock to check the swapping-related PS_* flags in
several places.
- faultin() no longer requires sched_lock to be held by callers.
- Rename PS_SWAPPING to PS_SWAPPINGOUT to be less ambiguous now that we
have PS_SWAPPINGIN.


113856 22-Apr-2003 alc

Revision 1.246 should have also included

- Weaken the assertion in vm_page_insert() to require Giant only if the
vm_object isn't locked.

Reported by: "Ilmar S. Habibulin" <ilmar@watson.org>


113842 22-Apr-2003 alc

Remove unused declarations.


113841 22-Apr-2003 alc

Revision 1.52 of vm/uma_core.c has led to UMA's obj_alloc() being
called without Giant; and obj_alloc() in turn calls vm_page_alloc()
without Giant. This causes an assertion failure in vm_page_alloc().
Fortunately, obj_alloc() is now MPSAFE. So, we need only clean up
some assertions.

- Weaken the assertion in vm_page_lookup() to require Giant only
if the vm_object isn't locked.
- Remove an assertion from vm_page_alloc() that duplicates a check
performed in vm_page_lookup().

In collaboration with: gallatin, jake, jeff


113838 22-Apr-2003 alc

Add VM_OBJECT_LOCKED().


113791 21-Apr-2003 alc

- Assert that the vm_object is locked in vm_object_clear_flag(),
vm_object_pip_add() and vm_object_pip_wakeup().
- Remove GIANT_REQUIRED from vm_object_pip_subtract() and
vm_object_pip_subtract().
- Lock the vm_object when performing vm_object_page_remove().


113775 20-Apr-2003 alc

- Lock the vm_object when performing either vm_object_clear_flag() or
vm_object_pip_wakeup().


113768 20-Apr-2003 alc

- Update the vm_object locking in vm_map_insert().


113765 20-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_wakeup().
- Merge two identical cases in a switch statement.


113761 20-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_wakeup().


113744 20-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_add().
- Remove an unnecessary variable.


113740 20-Apr-2003 alc

Update vm_object locking in vm_map_delete().


113739 20-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_add().


113722 19-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_subtract().
- Assert that the vm_object lock is held in vm_object_pip_subtract().


113721 19-Apr-2003 alc

- Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
- Add a new macro VM_OBJECT_LOCK_ASSERT().


113701 19-Apr-2003 alc

o Update locking around vm_object_page_remove() in vm_map_clean()
to use the new macros.
o Remove unnecessary increment and decrement of the vm_object's
reference count in vm_map_clean().


113699 19-Apr-2003 alc

Lock the vm_object in obj_alloc().


113671 18-Apr-2003 alc

Update locking around vm_object_page_remove() to use the new macros.


113665 18-Apr-2003 gallatin

Don't grab Giant in slab_zalloc() if M_NOWAIT is specified. This
should allow the use of INTR_MPSAFE network drivers.

Tested by: njl
Glanced at by: jeff


113639 17-Apr-2003 jhb

suser() does not need the proc lock, just the setting of P_PROTECTED in
p_flag needs the lock.


113603 17-Apr-2003 trhodes

Add some tunable descriptions.

Submitted by: hmp
Discussed with: bde


113600 17-Apr-2003 trhodes

Pre-content whitespace commit.

Discussed with: bde


113489 15-Apr-2003 alc

Update locking on the kmem_object to use the new macros.


113458 14-Apr-2003 alc

Update locking on the kernel_object to use the new macros.


113457 13-Apr-2003 alc

Lock some manipulations of the vm object's flags.


113449 13-Apr-2003 alc

Lock some manipulations of the vm object's flags.


113448 13-Apr-2003 alc

Lock some manipulations of the vm object's flags.


113445 13-Apr-2003 alc

Add new macros for locking and unlocking a vm object.


113419 13-Apr-2003 alc

Permit vm_object_pip_add() and vm_object_pip_wakeup() on the kmem_object
without Giant held.


113418 13-Apr-2003 alc

Eliminate unnecessary gotos from kmem_malloc().


113343 10-Apr-2003 jhb

- Kill the pv_flags member of the alpha mdpage since it stop being used
in rev 1.61 of pmap.c.
- Now that pmap_page_is_free() is empty and since it is just a hack for
the Alpha pmap, remove it.


113138 05-Apr-2003 alc

Remove GIANT_REQUIRED from getpbuf(). Reviewed by: tegge

Reduce pbuf_mtx's scope in relpbuf(). Submitted by: tegge


113070 04-Apr-2003 des

Rename a static variable to avoid future conflicts.


112881 31-Mar-2003 wes

Add a facility allowing processes to inform the VM subsystem they are
critical and should not be killed when pageout is looking for more
memory pages in all the wrong places.

Reviewed by: arch@
Sponsored by: St. Bernard Software


112835 30-Mar-2003 mux

The object type can't be OBJT_PHYS in vm_mmap().

Reviewed by: peter


112683 26-Mar-2003 tegge

Obtain Giant before calling kmem_alloc without M_NOWAIT and before calling
kmem_free if Giant isn't already held.


112569 25-Mar-2003 jake

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


112390 19-Mar-2003 mux

Remove an empty comment.


112367 18-Mar-2003 phk

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


112329 17-Mar-2003 jake

Subtract the memory that backs the vm_page structures from phys_avail
after mapping it. This makes it possible to determine if a physical
page has a backing vm_page or not.


112312 16-Mar-2003 jake

Made the prototypes for pmap_kenter and pmap_kremove MD. These functions
are machine dependent because they are not required to update the tlb when
mappings are added or removed, and doing so is machine dependent.
In addition, an implementation may require that pages mapped with pmap_kenter
have a backing vm_page_t, which is not necessarily true of all physical
pages, and so may choose to pass the vm_page_t to pmap_kenter instead of the
physical address in order to make this requirement clear.


112167 12-Mar-2003 das

- When the VM daemon is out of swap space and looking for a
process to kill, don't block on a map lock while holding the
process lock. Instead, skip processes whose map locks are held
and find something else to kill.
- Add vm_map_trylock_read() to support the above.

Reviewed by: alc, mike (mentor)


111977 08-Mar-2003 ken

Zero copy send and receive fixes:

- On receive, vm_map_lookup() needs to trigger the creation of a shadow
object. To make that happen, call vm_map_lookup() with PROT_WRITE
instead of PROT_READ in vm_pgmoveco().

- On send, a shadow object will be created by the vm_map_lookup() in
vm_fault(), but vm_page_cowfault() will delete the original page from
the backing object rather than simply letting the legacy COW mechanism
take over. In other words, the new page should be added to the shadow
object rather than replacing the old page in the backing object. (i.e.
vm_page_cowfault() should not be called in this case.) We accomplish
this by making sure fs.object == fs.first_object before calling
vm_page_cowfault() in vm_fault().

Submitted by: gallatin, alc
Tested by: ken


111937 06-Mar-2003 alc

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


111936 05-Mar-2003 rwatson

Provide a mac_check_system_swapoff() entry point, which permits MAC
modules to authorize disabling of swap against a particular vnode.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


111883 04-Mar-2003 jhb

Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls
to WITNESS_WARN().


111732 02-Mar-2003 phk

NO_GEOM cleanup:

Use VOP_IOCTL(DIOCGMEDIASIZE) to check the size of a potential swap device
instead of the cdevsw->d_psize() method.


111712 01-Mar-2003 alc

Teach vm_page_sleep_if_busy() to release the vm_object lock before sleeping.


111467 25-Feb-2003 alc

Fuse two #ifdefs with identical conditions.


111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


111462 25-Feb-2003 mux

Cleanup of the d_mmap_t interface.

- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.

I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.

Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386


111434 24-Feb-2003 alc

In vm_page_dirty(), assert that the page is not in the free queue(s).


111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


110983 16-Feb-2003 alc

Remove GIANT_REQUIRED from vm_pageq_remove().


110958 15-Feb-2003 alc

Remove the acquisition and release of Giant around pmap_growkernel().
It's unnecessary for two reasons: (1) Giant is at present already held in
such cases and (2) our various implementations of pmap_growkernel() look to
be MP safe. (For example, for sparc64 the proof of (2) is trivial.)


110957 15-Feb-2003 alc

Move kernel_vm_end's declaration to pmap.h; add a comment regarding the
synchronization of access to kernel_vm_end.


110597 09-Feb-2003 alc

Add a comment describing how pagedaemon_wakeup() should be used and
synchronized.

Suggested by: tegge


110313 04-Feb-2003 phk

Change a printf to also tell how many items were left in the zone.


110225 02-Feb-2003 alc

- It's more accurate to say that vm_paging_needed() returns TRUE
than a positive number.
- In pagedaemon_wakeup(), set vm_pages_needed to 1 rather than
incrementing it to accomplish the same.


110218 02-Feb-2003 alc

- Convert vm_pageout()'s tsleep()s to msleep()s with the page queue lock.


110207 01-Feb-2003 alc

- Remove (some) unnecessary explicit initializations to zero.
- Style changes to vm_pageout(): declarations and white-space.


110204 01-Feb-2003 alc

- Convert the tsleep()s in vm_wait() and vm_waitpfault() to msleep()s
with the page queue lock.
- Assert that the page queue lock is held in vm_page_free_wakeup().


109912 27-Jan-2003 alc

Simplify vm_object_page_remove(): The object's memq is now ordered. The
two cases that existed before for performance optimization purposes can
be reduced to one.


109820 25-Jan-2003 alc

Add MTX_DUPOK to the initialization of system map locks.


109630 21-Jan-2003 alfred

use 'void *' instead of 'caddr_t' for useracc, kernacc, vslock and vsunlock.


109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


109587 20-Jan-2003 dillon

Fix swapping to a file, it was broken when SPECSTRATEGY was introduced.


109572 20-Jan-2003 dillon

Close the remaining user address mapping races for physical
I/O, CAM, and AIO. Still TODO: streamline useracc() checks.

Reviewed by: alc, tegge
MFC after: 7 days


109554 20-Jan-2003 alc

- Hold the page queues lock around vm_page_hold().
- Assert that the page queues lock rather than Giant is held in
vm_page_hold().


109548 20-Jan-2003 jeff

- M_WAITOK is 0 and not a real flag. Test for this properly.

Submitted by: tmm
Pointy hat to: jeff


109496 18-Jan-2003 obrien

Rev 1.16 renamed VM_METER to VM_TOTAL. This is breaking 3rd-party apps.
So add a VM_METER compat define.

Submitted by: Andy Fawcett <andy@athame.co.uk>


109342 16-Jan-2003 dillon

Merge all the various copies of vm_fault_quick() into a single
portable copy.


109223 14-Jan-2003 alc

- Update vm_pageout_deficit using atomic operations. It's a simple
counter outside the scope of existing locks.
- Eliminate a redundant clearing of vm_pageout_deficit.


109216 14-Jan-2003 alc

Make vm_pageout_page_free() static.


109205 13-Jan-2003 dillon

It is possible for an active aio to prevent shared memory from being
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.

Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().

MFC after: 7 days


109198 13-Jan-2003 phk

We can get past here on a normal vnode as well, so use VOP_STRATEGY if so.


109153 13-Jan-2003 dillon

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


109151 12-Jan-2003 alc

Make vm_page_alloc() return PG_ZERO only if VM_ALLOC_ZERO is specified.
The objective being to eliminate some cases of page queues locking.
(See, for example, vm/vm_fault.c revision 1.160.)

Reviewed by: tegge

(Also, pointed out by tegge that I changed vm_fault.c before changing
vm_page.c. Oops.)


109131 12-Jan-2003 alc

vm_fault_copy_entry() needn't clear PG_ZERO because it didn't pass
VM_ALLOC_ZERO to vm_page_alloc().


109123 12-Jan-2003 dillon

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


109114 11-Jan-2003 alc

In vm_page_alloc(), fuse two if statements that are conditioned on the same
expression.


109097 11-Jan-2003 dillon

Make 'sysctl vm.vmtotal' work properly using updated patch from Hiten.
(the patch in the PR was stale).

PR: kern/5689
Submitted by: Hiten Pandya <hiten@unixdaemons.com>


108963 08-Jan-2003 alc

In vm_page_alloc(), honor VM_ALLOC_ZERO for system and interrupt class
requests when the number of free pages is below the reserved threshold.
Previously, VM_ALLOC_ZERO was only honored when the number of free pages
was above the reserved threshold. Honoring it in all cases generally
makes sense, does no harm, and simplifies the code.


108723 05-Jan-2003 phk

Convert VOP_STRATEGY to VOP_SPECSTRATEGY in the generic getpages and
the pager input for small filesystems.


108693 05-Jan-2003 alc

Use atomic add and subtract to update the global wired page count,
cnt.v_wire_count.


108686 04-Jan-2003 phk

Temporarily introduce a new VOP_SPECSTRATEGY operation while I try
to sort out disk-io from file-io in the vm/buffer/filesystem space.

The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.

Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.

Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.

Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.

Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.

If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org


108677 04-Jan-2003 alc

Allow kmem_malloc() without Giant if M_NOWAIT is specified.


108676 04-Jan-2003 alc

Use vm_object_lock() and vm_object_unlock() in vm_object_deallocate().
(This procedure needs further work, but this change is sufficient for
locking the kmem_object.)


108675 04-Jan-2003 alc

Refine the assertions in vm_page_alloc().


108610 03-Jan-2003 alc

Refine the assertion in vm_object_clear_flag() to allow operation on the
kmem_object without Giant. In that case, assert that the kmem_object's
mutex is held.


108609 03-Jan-2003 phk

Revert use of dmmax_mask, I had overlooked a '~'.

Spotted by: bde


108602 03-Jan-2003 phk

Make struct swblock kernel only, to make vm/swap_pager.h userland includable.
Move struct swdevt from sys/conf.h to the more appropriate vm/swap_pager.h.
Adjust #include use in libkvm and pstat(8) to match.


108600 03-Jan-2003 phk

Avoid extern decls in .c files by putting them in the vm/swap_pager.h
include file where they belong.
Share the dmmax_mask variable.


108599 03-Jan-2003 phk

Use correct _VM_SWAP_PAGER_H_ to check for multiple inclusion.


108595 03-Jan-2003 phk

Retire sys/dmap.h by including the two lines of it which matters
directly in vm/vm_swap.c.


108594 03-Jan-2003 alc

Lock the vm object when performing vm_object_clear_flag().


108589 03-Jan-2003 phk

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


108585 03-Jan-2003 alc

Add vm map and vm object locking to vmtotal().


108551 02-Jan-2003 alc

Lock the vm object when performing vm_object_clear_flag().


108534 01-Jan-2003 alc

Update the assertions in vm_page_insert() and vm_page_lookup() to reflect
locking of the kmem_object.


108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


108518 01-Jan-2003 alc

Add a needed #include.

Reported by: ia64 tinderbox


108515 31-Dec-2002 alc

Implement a variant locking scheme for vm maps: Access to system maps
is now synchronized by a mutex, whereas access to user maps is still
synchronized by a lockmgr()-based lock. Why? No single type of lock,
including sx locks, meets the requirements of both types of vm map.
Sometimes we sleep while holding the lock on a user map. Thus, a
a mutex isn't appropriate. On the other hand, both lockmgr()-based
and sx locks release Giant when a thread/process blocks during
contention for a lock. This could lead to a race condition in a legacy
driver (that relies on Giant for synchronization) if it attempts to
kmem_malloc() and fails to immediately obtain the lock. Fortunately,
we never sleep while holding a system map lock.


108426 30-Dec-2002 alc

- Mark the kernel_map as a system map immediately after its creation.
- Correct a cast.


108418 30-Dec-2002 alc

- Increment the vm_map's timestamp if _vm_map_trylock() succeeds.
- Introduce map_sleep_mtx and use it to replace Giant in
vm_map_unlock_and_wait() and vm_map_wakeup(). (Original
version by: tegge.)


108413 29-Dec-2002 alc

- Remove vm_object_init2(). It is unused.
- Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero()
to migrate from _vm_object_allocate() to vm_object_zinit(), where it
will be performed less often.)


108384 29-Dec-2002 alc

Reduce the number of times that we acquire and release the page queues
lock by making vm_page_rename()'s caller, rather than vm_page_rename(),
responsible for acquiring it.


108370 28-Dec-2002 alc

Assert that the page queues lock rather than Giant is held in
vm_page_flag_clear().


108361 28-Dec-2002 dillon

vm_pager_put_pages() takes VM_PAGER_* flags, not OBJPC_* flags. It just
so happens that OBJPC_SYNC has the same value as VM_PAGER_PUT_SYNC so no
harm done. But fix it :-)

No operational changes.

MFC after: 1 day


108358 28-Dec-2002 dillon

Allow the VM object flushing code to cluster. When the filesystem syncer
comes along and flushes a file which has been mmap()'d SHARED/RW, with
dirty pages, it was flushing the underlying VM object asynchronously,
resulting in thousands of 8K writes. With this change the VM Object flushing
code will cluster dirty pages in 64K blocks.

Note that until the low memory deadlock issue is reviewed, it is not safe
to allow the pageout daemon to use this feature. Forced pageouts still
use fs block size'd ops for the moment.

MFC after: 3 days


108351 28-Dec-2002 alc

Two changes to kmem_malloc():
- Use VM_ALLOC_WIRED.
- Perform vm_page_wakeup() after pmap_enter(), like we do everywhere else.


108334 27-Dec-2002 alc

- Change vm_object_page_collect_flush() to assert rather than
acquire the page queues lock.
- Acquire the page queues lock in vm_object_page_clean().


108306 27-Dec-2002 alc

Increase the scope of the page queues lock in phys_pager_getpages().


108262 24-Dec-2002 alc

- Hold the page queues lock around calls to vm_page_flag_clear().


108251 24-Dec-2002 alc

- Hold the page queues lock around vm_page_wakeup().


108233 23-Dec-2002 alc

- Hold the kernel_object's lock around vm_page_insert(..., kernel_object,
...).


108197 23-Dec-2002 alc

Eliminate some dead code. (Any possible use for this code died with
vm/vm_page.c revision 1.220.)

Submitted by: bde


108171 22-Dec-2002 dillon

The UP -current was not properly counting the per-cpu VM stats in the
sysctl code. This makes 'systat -vm 1's syscall count work again.

Submitted by: Michal Mertl <mime@traveller.cz>
Note: also slated for 5.0


108138 20-Dec-2002 alc

Increase the scope of the kmem_object locking in kmem_malloc().


108117 20-Dec-2002 alc

Add a mutex to struct vm_object. Initialize and destroy that mutex
at appropriate times. For the moment, the mutex is only used on
the kmem_object.


108101 19-Dec-2002 alc

Remove the hash_rand field from struct vm_object. As of revision 1.215 of
vm/vm_page.c, it is unused.


108081 19-Dec-2002 alc

- Remove vm_page_sleep_busy(). The transition to vm_page_sleep_if_busy(),
which incorporates page queue and field locking, is complete.
- Assert that the page queue lock rather than Giant is held in
vm_page_flag_set().


108068 19-Dec-2002 alc

- Hold the page queues lock when performing vm_page_busy() or
vm_page_flag_set().
- Replace vm_page_sleep_busy() with proper page queues locking
and vm_page_sleep_if_busy().


108012 18-Dec-2002 alc

- Hold the page queues lock when performing vm_page_busy().
- Replace vm_page_sleep_busy() with proper page queues locking
and vm_page_sleep_if_busy().


108011 18-Dec-2002 alc

Hold the page queues lock when performing vm_page_flag_set().


107989 17-Dec-2002 alc

Hold the page queues lock when performing vm_page_flag_set().


107948 16-Dec-2002 dillon

Change the way ELF coredumps are handled. Instead of unconditionally
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.

Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.

This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.

PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days


107918 15-Dec-2002 alc

Perform vm_object_lock() and vm_object_unlock() on kmem_object
around vm_page_lookup() and vm_page_free().


107913 15-Dec-2002 dillon

This is David Schultz's swapoff code which I am finally able to commit.
This should be considered highly experimental for the moment.

Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 3 weeks


107912 15-Dec-2002 dillon

Fix a refcount race with the vmspace structure. In order to prevent
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.

This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.

MFC after: 3 weeks


107893 15-Dec-2002 alc

As per the comments, vm_object_page_remove() now expects its caller to lock
the object (i.e., acquire Giant).


107892 15-Dec-2002 alc

Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().


107891 15-Dec-2002 alc

Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().


107887 15-Dec-2002 alc

Assert that the page queues lock is held in vm_page_unhold(),
vm_page_remove(), and vm_page_free_toq().


107464 01-Dec-2002 alc

Hold the page queues lock when calling pmap_protect(); it updates fields
of the vm_page structure. Make the style of the pmap_protect() calls
consistent.

Approved by: re (blanket)


107436 01-Dec-2002 alc

Hold the page queues lock when calling pmap_protect(); it updates fields
of the vm_page structure. Nearby, remove an unnecessary semicolon and
return statement.

Approved by: re (blanket)


107433 01-Dec-2002 alc

Increase the scope of the page queue lock in vm_pageout_scan().

Approved by: re (blanket)


107370 28-Nov-2002 alc

Lock page field accesses in mincore().

Approved by: re (blanket)


107347 27-Nov-2002 alc

Hold the page queues lock when performing pmap_clear_modify().

Approved by: re (blanket)


107304 27-Nov-2002 alc

Hold the page queues lock while performing pmap_page_protect().

Approved by: re (blanket)


107250 25-Nov-2002 alc

Acquire and release the page queues lock around calls to pmap_protect()
because it updates flags within the vm page.

Approved by: re (blanket)


107200 24-Nov-2002 alc

Extend the scope of the page queues/fields locking in vm_freeze_copyopts()
to cover pmap_remove_all().

Approved by: re


107189 23-Nov-2002 alc

Hold the page queues/flags lock when calling vm_page_set_validclean().

Approved by: re


107185 23-Nov-2002 alc

Assert that the page queues lock rather than Giant is held in
vm_pageout_page_free().

Approved by: re


107182 23-Nov-2002 alc

Add page queue and flag locking in vnode_pager_setsize().

Approved by: re


107136 21-Nov-2002 jeff

- Add an event that is triggered when the system is low on memory. This is
intended to be used by significant memory consumers so that they may drain
some of their caches.

Inspired by: phk
Approved by: re
Tested on: x86, alpha


107048 18-Nov-2002 jeff

- Wakeup the correct address when a zone is no longer full.

Spotted by: jake


107039 18-Nov-2002 alc

Remove vm_page_protect(). Instead, use pmap_page_protect() directly.


106992 16-Nov-2002 jeff

- Don't forget the flags value when using boot pages.

Reported by: grehan


106981 16-Nov-2002 alc

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


106871 13-Nov-2002 alc

Remove dead code that hasn't been needed since the demise of share maps
in various revisions of vm/vm_map.c between 1.148 and 1.153.


106838 13-Nov-2002 alc

Move pmap_collect() out of the machine-dependent code, rename it
to reflect its new location, and add page queue and flag locking.

Notes: (1) alpha, i386, and ia64 had identical implementations
of pmap_collect() in terms of machine-independent interfaces;
(2) sparc64 doesn't require it; (3) powerpc had it as a TODO.


106778 11-Nov-2002 cognet

Remove extra #include<sys/vmmeter.h>.


106773 11-Nov-2002 mjacob

atomic_set_8 isn't MI. Instead, follow Jake's suggestions about
ZONE_LOCK.


106753 11-Nov-2002 alc

- Clear the page's PG_WRITEABLE flag in the i386's pmap_changebit()
if we're removing write access from the page's PTEs.
- Export pmap_remove_all() on alpha, i386, and ia64. (It's already
exported on sparc64.)


106733 10-Nov-2002 mjacob

Use atomic_set_8 on the us_freelist maps as they are not otherwise
protected. Furthermore, in some RISC architectures with no normal
byte operations, the surrounding 3 bytes are also affected by the
read-modify-write that has to occur.


106720 10-Nov-2002 alc

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


106708 09-Nov-2002 alc

Fix an error case in vm_map_wire(): unwiring of an entry during cleanup
after a user wire error fails when the entry is already system wired.

Reported by: tegge


106691 09-Nov-2002 alc

In vm_page_remove(), avoid calling vm_page_splay() if the object's memq
is empty.


106605 07-Nov-2002 tmm

Move the definitions of the hw.physmem, hw.usermem and hw.availpages
sysctls to MI code; this reduces code duplication and makes all of them
available on sparc64, and the latter two on powerpc.
The semantics by the i386 and pc98 hw.availpages is slightly changed:
previously, holes between ranges of available pages would be included,
while they are excluded now. The new behaviour should be more correct
and brings i386 in line with the other architectures.

Move physmem to vm/vm_init.c, where this variable is used in MI code.


106603 07-Nov-2002 mux

Better printf() formats.


106602 07-Nov-2002 mux

Some more printf() format fixes.


106600 07-Nov-2002 mux

Correctly print vm_offset_t types.


106422 04-Nov-2002 alc

Export the function vm_page_splay().


106387 03-Nov-2002 alc

- Remove the memory allocation for the object/offset hash table
because it's no longer used. (See revision 1.215.)
- Fix a harmless bug: the number of vm_page structures allocated wasn't
properly adjusted when uma_bootstrap() was introduced. Consequently,
we were allocating 30 unused vm_page structures.
- Wrap a long line.


106359 02-Nov-2002 alc

Remove the vm page buckets mutex. As of revision 1.215 of vm/vm_page.c,
it is unused.


106277 01-Nov-2002 jeff

- Add support for machine dependant page allocation routines. MD code
may define UMA_MD_SMALL_ALLOC to make use of this feature.

Reviewed by: peter, jake


106276 01-Nov-2002 jeff

- Add a new flag to vm_page_alloc, VM_ALLOC_NOOBJ. This tells
vm_page_alloc not to insert this page into an object. The pindex is
still used for colorization.
- Rework vm_page_select_* to accept a color instead of an object and
pindex to work with VM_PAGE_NOOBJ.
- Document other VM_ALLOC_ flags.

Reviewed by: peter, jake


106023 27-Oct-2002 rwatson

Merge from MAC tree: rename mac_check_vnode_swapon() to
mac_check_system_swapon(), to reflect the fact that the primary
object of this change is the running kernel as a whole, rather
than just the vnode. We'll drop additional checks of this
class into the same check namespace, including reboot(),
sysctl(), et al.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105853 24-Oct-2002 jeff

- Now that uma_zalloc_internal is not the fast path don't be so fussy about
extra function calls. Refactor uma_zalloc_internal into seperate functions
for finding the most appropriate slab, filling buckets, allocating single
items, and pulling items off of slabs. This makes the code significantly
cleaner.
- This also fixes the "Returning an empty bucket." panic that a few people
have seen.

Tested On: alpha, x86


105848 24-Oct-2002 jeff

- Move the destructor calls so that they are not called with the zone lock
held. This avoids a lock order reversal when destroying zones.
Unfortunately, this also means that the free checks are not done before
the destructor is called.

Reported by: phk


105718 22-Oct-2002 rwatson

Invoke mac_check_vnode_mmap() during mmap operations on vnodes,
permitting policies to restrict access to memory mapping based on
the credential requesting the mapping, the target vnode, the
requested rights, or other policy considerations.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105717 22-Oct-2002 rwatson

Introduce MAC_CHECK_VNODE_SWAPON, which permits MAC policies to
perform authorization checks during swapon() events; policies
might choose to enforce protections based on the credential
requesting the swap configuration, the target of the swap operation,
or other factors such as internal policy state.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105695 22-Oct-2002 jhb

- Check that a process isn't a new process (p_state == PRS_NEW) before
trying to acquire it's proc lock since the proc lock may not have been
constructed yet.
- Split up the one big comment at the top of the loop and put the pieces
in the right order above the various checks.

Reported by: kris (1)


105689 22-Oct-2002 sheldonh

Fix typo in comments (misspelled "necessary").


105549 20-Oct-2002 alc

o Reinline vm_page_undirty(), reducing the kernel size. (This reverts
a part of vm_page.h revision 1.87 and vm_page.c revision 1.167.)


105466 19-Oct-2002 alc

Complete the page queues locking needed for the page-based copy-
on-write (COW) mechanism. (This mechanism is used by the zero-copy
TCP/IP implementation.)
- Extend the scope of the page queues lock in vm_fault()
to cover vm_page_cowfault().
- Modify vm_page_cowfault() to release the page queues lock
if it sleeps.


105407 18-Oct-2002 dillon

Replace the vm_page hash table with a per-vmobject splay tree. There should
be no major change in performance from this change at this time but this
will allow other work to progress: Giant lock removal around VM system
in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for
NFS, partial filesystem syncs by the syncer, more optimal object flushing,
etc. Note that the buffer cache is already using a similar splay tree
mechanism.

Note that a good chunk of the old hash table code is still in the tree.
Alan or I will remove it prior to the release if the new code does not
introduce unsolvable bugs, else we can revert more easily.

Submitted by: alc (this is Alan's code)
Approved by: re


105229 16-Oct-2002 phk

Properly put macro args in ().

Spotted by: FlexeLint.


105126 14-Oct-2002 julian

Remove old useless debugging code


104964 12-Oct-2002 jeff

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


104387 02-Oct-2002 jhb

Rename the mutex thread and process states to use a more generic 'LOCK'
name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of
TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that
will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will
be used internally by the turnstile code and will not be specific to
mutexes. Making the change now ensures that turnstiles can be dropped
in at a later date without affecting the ABI of userland applications.


104354 02-Oct-2002 scottl

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


103925 25-Sep-2002 jeff

- Get rid of the unused LK_NOOBJ.


103924 25-Sep-2002 jeff

- Lock access to numoutput on the swap devices.


103923 25-Sep-2002 jeff

- Add a ASSERT_VOP_LOCKED in vnode_pager_alloc.
- Lock access to v_iflags.


103794 22-Sep-2002 mdodd

Modify vm_map_clean() (and thus the msync(2) system call) to support
invalidation of cached pages for objects of type OBJT_DEVICE.

Submitted by: Christian Zander <zander@minion.de>
Approved by: alc


103777 22-Sep-2002 alc

o Update some comments.


103767 21-Sep-2002 jake

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


103732 21-Sep-2002 alc

Reduce namespace pollution.

Submitted by: bde


103623 19-Sep-2002 jeff

- Use my freebsd email alias in the copyright.
- Remove redundant instances of my email alias in the file summary.


103531 18-Sep-2002 jeff

- Split UMA_ZFLAG_OFFPAGE into UMA_ZFLAG_OFFPAGE and UMA_ZFLAG_HASH.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.


103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


103216 11-Sep-2002 julian

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


103123 09-Sep-2002 tanimura

- Do not swap out a process if it is in creation. The process may have no
address space yet.

- Check whether a process is a system process prior to dereferencing
its p_vmspace. Aio assumes that only the curthread switches the address
space of a system process.


103002 06-Sep-2002 julian

Use UMA as a complex object allocator.
The process allocator now caches and hands out complete process structures
*including substructures* .

i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.

For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.

Reviewed by: davidxu@freebsd.org, peter@freebsd.org


102966 05-Sep-2002 bde

Use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.


102950 05-Sep-2002 davidxu

s/SGNL/SIG/
s/SNGL/SINGLE/
s/SNGLE/SINGLE/

Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.

Approved by: julian (mentor)


102835 02-Sep-2002 alc

o Synchronize updates to struct vm_page::cow with the page queues lock.


102738 31-Aug-2002 dillon

Reduce the maximum KVA reserved for swap meta structures from 70 to 32 MB.
Reduce the swap meta calculation by a factor of 2, it's still massive overkill.

X-MFC after: immediately


102600 30-Aug-2002 peter

Change hw.physmem and hw.usermem to unsigned long like they used to be
in the original hardwired sysctl implementation.

The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int. Use
'long' there.

Change Maxmem and physmem and related variables to 'long', mostly for
completeness. Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody. This
comes for free on 32 bit machines, so why not?


102399 25-Aug-2002 alc

o Retire pmap_pageable(). It's an advisory routine that none
of our platforms implements.


102382 25-Aug-2002 alc

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


102372 24-Aug-2002 alc

o Use vm_object_lock() in place of directly locking Giant.

Reviewed by: md5


102370 24-Aug-2002 alc

o Use vm_object_lock() in place of Giant when manipulating a vm object
in vm_map_insert().


102349 24-Aug-2002 alc

o Resurrect vm_object_lock() and vm_object_unlock() from revision 1.19.
(For now, they simply acquire and release Giant.)


102241 21-Aug-2002 archie

Don't use "NULL" when "0" is really meant.


101657 11-Aug-2002 alc

o Assert that the page queues lock is held in vm_page_activate().


101656 11-Aug-2002 alc

o Lock page queue accesses by vm_page_activate().


101655 10-Aug-2002 alc

o Lock page queue accesses by vm_page_activate().


101654 10-Aug-2002 alc

o Move a call to vm_page_wakeup() inside the scope of the page queues lock.


101645 10-Aug-2002 alc

o Remove the setting and clearing of the PG_MAPPED flag from the alpha and
ia64 pmap.
o Remove the PG_MAPPED flag's declaration.


101634 10-Aug-2002 alc

o Remove the setting and clearing of the PG_MAPPED flag. (This flag is
obsolete.)


101543 08-Aug-2002 alc

o Use pmap_page_is_mapped() in vm_page_protect() rather than the PG_MAPPED
flag. (This is the only place in the entire kernel where the PG_MAPPED
flag is tested. It will be removed soon.)


101327 04-Aug-2002 alc

o Acquire the page queues lock before checking the page's busy status
in vm_page_grab(). Also, replace the nearby tsleep() with an msleep()
on the page queues lock.


101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


101304 04-Aug-2002 alc

o Extend the scope of the page queues lock in contigmalloc1().
o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy()
in vm_contig_launder().


101250 03-Aug-2002 alc

o Remove the setting of PG_MAPPED from vm_page_wire() and
vm_page_alloc(VM_ALLOC_WIRED).


101236 02-Aug-2002 alc

o Convert two instances of vm_page_sleep_busy() into vm_page_sleep_if_busy()
with appropriate page queue locking.


101200 02-Aug-2002 alc

o Lock page queue accesses in nwfs and smbfs.
o Assert that the page queues lock is held in vm_page_deactivate().


101196 02-Aug-2002 alc

o Lock page queue accesses by vm_page_deactivate().


101174 01-Aug-2002 alc

o Acquire the page queues lock before calling vm_page_io_finish().
o Assert that the page queues lock is held in vm_page_io_finish().


101105 31-Jul-2002 alc

o Setting PG_MAPPED and PG_WRITEABLE on pages that are mapped and unmapped
by pmap_qenter() and pmap_qremove() is pointless. In fact, it probably
leads to unnecessary pmap_page_protect() calls if one of these pages is
paged out after unwiring.

Note: setting PG_MAPPED asserts that the page's pv list may be
non-empty. Since checking the status of the page's pv list isn't any
harder than checking this flag, the flag should probably be eliminated.
Alternatively, PG_MAPPED could be set by pmap_enter() exclusively
rather than various places throughout the kernel.


101019 31-Jul-2002 alc

o Lock page accesses by vm_page_io_start() with the page queues lock.
o Assert that the page queues lock is held in vm_page_io_start().


100915 30-Jul-2002 alc

o In vm_object_madvise() and vm_object_page_remove() replace
vm_page_sleep_busy() with vm_page_sleep_if_busy(). At the same time,
increase the scope of the page queues lock. (This should significantly
reduce the locking overhead in vm_object_page_remove().)
o Apply some style fixes.


100913 30-Jul-2002 tanimura

- Optimize wakeup() and its friends; if a thread waken up is being
swapped in, we do not have to ask for the scheduler thread to do
that.

- Assert that a process is not swapped out in runq functions and
swapout().

- Introduce thread_safetoswapout() for readability.

- In swapout_procs(), perform a test that may block (check of a
thread working on its vm map) first. This lets us call swapout()
with the sched_lock held, providing a better atomicity.


100889 29-Jul-2002 alc

o Introduce vm_page_sleep_if_busy() as an eventual replacement for
vm_page_sleep_busy(). vm_page_sleep_if_busy() uses the page
queues lock.


100885 29-Jul-2002 julian

Remove a XXXKSE comment. the code is no longer a problem..


100884 29-Jul-2002 julian

Create a new thread state to describe threads that would be ready to run
except for the fact tha they are presently swapped out. Also add a process
flag to indicate that the process has started the struggle to swap
back in. This will be needed for the case where multiple threads
start the swapin action top a collision. Also add code to stop
a process fropm being swapped out if one of the threads in this
process is actually off running on another CPU.. that might hurt...

Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>


100862 29-Jul-2002 alc

o Pass VM_ALLOC_WIRED to vm_page_grab() rather than calling vm_page_wire()
in pmap_new_thread(), pmap_pinit(), and vm_proc_new().
o Lock page queue accesses by vm_page_free() in pmap_object_init_pt().


100836 28-Jul-2002 alc

o Modify vm_page_grab() to accept VM_ALLOC_WIRED.


100832 28-Jul-2002 alc

o Lock page queue accesses by vm_page_free().
o Apply some style fixes.


100829 28-Jul-2002 alc

o Lock page queue accesses by vm_page_free().


100797 28-Jul-2002 alc

o Lock page queue accesses by vm_page_free().
o Increment cnt.v_dfree inside vm_pageout_page_free() rather than
at each call.


100796 28-Jul-2002 alc

o Lock page queue accesses by vm_page_free().


100779 27-Jul-2002 alc

o Require that the page queues lock is held on entry to vm_pageout_clean()
and vm_pageout_flush().
o Acquire the page queues lock before calling vm_pageout_clean()
or vm_pageout_flush().


100742 27-Jul-2002 alc

o Lock page queue accesses by vm_page_activate().


100740 27-Jul-2002 alc

o Lock page queue accesses by vm_page_activate() and vm_page_deactivate()
in vm_pageout_object_deactivate_pages().
o Apply some style fixes to vm_pageout_object_deactivate_pages().


100736 27-Jul-2002 alc

o Lock page queue accesses by vm_page_activate() and vm_page_deactivate().


100686 25-Jul-2002 alc

o Remove a vm_page_deactivate() that is immediately followed by a
vm_page_rename() from vm_object_backing_scan(). vm_page_rename()
also performs vm_page_deactivate() on pages in the cache queues,
making the removed vm_page_deactivate() redundant.


100630 24-Jul-2002 alc

o Merge vm_fault_wire() and vm_fault_user_wire() by adding a new parameter,
user_wire.


100545 23-Jul-2002 alc

o Lock page queue accesses by vm_page_dontneed().
o Assert that the page queue lock is held in vm_page_dontneed().


100542 23-Jul-2002 alc

o Extend the scope of the page queues lock in vm_pageout_scan()
to cover the traversal of the cache queue.


100512 22-Jul-2002 alfred

Change struct vmspace->vm_shm from void * to struct shmmap_state *, this
removes the need for casts in several cases.


100511 22-Jul-2002 alfred

Remove caddr_t.


100456 21-Jul-2002 alc

o Lock page queue accesses by vm_page_free() and vm_page_deactivate().


100452 21-Jul-2002 alc

o Lock page queue accesses by vm_page_free().


100438 21-Jul-2002 tanimura

Do not pass a thread with the state TDS_RUNQ to setrunqueue(), otherwise
assertion in setrunqueue() fails.


100415 20-Jul-2002 alc

o Lock page queue accesses by vm_page_try_to_cache(). (The accesses
in kern/vfs_bio.c are already locked.)
o Assert that the page queues lock is held in vm_page_try_to_cache().


100414 20-Jul-2002 alc

o Assert that the page queues lock is held in vm_page_try_to_free().


100413 20-Jul-2002 alc

o Lock page queue accesses by vm_page_cache() in vm_fault() and
vm_pageout_scan(). (The others are already locked.)
o Assert that the page queues lock is held in vm_page_cache().


100407 20-Jul-2002 alc

o Lock accesses to the active page queue in vm_pageout_scan() and
vm_pageout_page_stats().


100397 20-Jul-2002 alc

o Lock page queue accesses by vm_page_cache() in vm_contig_launder().
o Micro-optimize the control flow in vm_contig_launder().


100396 20-Jul-2002 alc

o Remove dead and/or unused code.


100384 20-Jul-2002 peter

Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from: dfr (mostly, many tweaks from me).


100379 19-Jul-2002 peter

Set P_NOLOAD on the pagezero kthread so that it doesn't artificially skew
the loadav. This is not real load. If you have a nice process running in
the background, pagezero may sit in the run queue for ages and add one to
the loadav, and thereby affecting other scheduling decisions.


100342 19-Jul-2002 alc

o Duplicate an odd side-effect of vm_page_wire() in vm_page_allocate()
when VM_ALLOC_WIRED is specified: set the PG_MAPPED bit in flags.
o In both vm_page_wire() and vm_page_allocate() add a comment saying
that setting PG_MAPPED does not belong there.


100331 18-Jul-2002 alc

o Remove the acquisition and release of Giant from the idle priority thread
that pre-zeroes free pages.
o Remove GIANT_REQUIRED from some low-level page queue functions. (Instead
assertions on the page queue lock are being added to the higher-level
functions, like vm_page_wire(), etc.)

In collaboration with: peter


100326 18-Jul-2002 markm

Void functions cannot return values.


100309 18-Jul-2002 peter

(VM_MAX_KERNEL_ADDRESS - KERNBASE) / PAGE_SIZE may not fit in an integer.
Use lmin(long, long), not min(u_int, u_int). This is a problem here on
ia64 which has *way* more than 2^32 pages of KVA. 281474976710655 pages
to be precice.


100276 18-Jul-2002 alc

o Introduce an argument, VM_ALLOC_WIRED, that requests vm_page_alloc()
to return a wired page.
o Use VM_ALLOC_WIRED within Alpha's pmap_growkernel(). Also, because
Alpha's pmap_growkernel() calls vm_page_alloc() from within a critical
section, specify VM_ALLOC_INTERRUPT instead of VM_ALLOC_SYSTEM. (Only
VM_ALLOC_INTERRUPT is implemented entirely with a spin mutex.)
o Assert that the page queues mutex is held in vm_page_wire()
on Alpha, just like the other platforms.


100193 16-Jul-2002 alc

o Use vm_pageq_remove_nowakeup() and vm_pageq_enqueue() in
vm_page_zero_idle() instead of partially duplicated implementations.
In particular, this change guarantees that the number of free pages
in the free queue(s) matches the global free page count when Giant
is released.

Submitted by: peter (via his p4 "pmap" branch)


100031 15-Jul-2002 alc

o Create vm_contig_launder() to replace code that appears twice
in contigmalloc1().


100005 14-Jul-2002 alc

o Lock page queue accesses by vm_page_wire() that aren't
within a critical section.
o Assert that the page queues lock is held in vm_page_wire()
unless an Alpha.


99985 14-Jul-2002 alc

o Lock page queue accesses by vm_page_wire().


99934 13-Jul-2002 alc

o Lock page queue accesses by vm_page_unmanage().
o Assert that the page queues lock is held in vm_page_unmanage().


99927 13-Jul-2002 alc

o Complete the locking of page queue accesses by vm_page_unwire().
o Assert that the page queues lock is held in vm_page_unwire().
o Make vm_page_lock_queues() and vm_page_unlock_queues() visible
to kernel loadable modules.


99920 13-Jul-2002 alc

o Lock some page queue accesses, in particular, those by vm_page_unwire().


99893 12-Jul-2002 alc

o Assert GIANT_REQUIRED on system maps in _vm_map_lock(),
_vm_map_lock_read(), and _vm_map_trylock(). Submitted by: tegge
o Remove GIANT_REQUIRED from kmem_alloc_wait() and kmem_free_wakeup().
(This clears the way for exec_map accesses to move outside of Giant.
The exec_map is not a system map.)
o Remove some premature MPSAFE comments.

Reviewed by: tegge


99890 12-Jul-2002 dillon

Re-enable the idle page-zeroing code. Remove all IPIs from the idle
page-zeroing code as well as from the general page-zeroing code and use a
lazy tlb page invalidation scheme based on a callback made at the end
of mi_switch.

A number of people came up with this idea at the same time so credit
belongs to Peter, John, and Jake as well.

Two-way SMP buildworld -j 5 tests (second run, after stabilization)
2282.76 real 2515.17 user 704.22 sys before peter's IPI commit
2266.69 real 2467.50 user 633.77 sys after peter's commit
2232.80 real 2468.99 user 615.89 sys after this commit

Reviewed by: peter, jhb
Approved by: peter


99851 12-Jul-2002 peter

Avoid a vm_page_lookup() - that uses a spinlock protected hash. We can
just use the object's memq for our nefarious purposes.


99850 12-Jul-2002 alc

o Lock some (unfortunately, not yet all) accesses to the page queues.


99849 12-Jul-2002 alc

o Lock accesses to the page queues.


99754 11-Jul-2002 alc

o Add a "needs wakeup" flag to the vm_map for use by kmem_alloc_wait()
and kmem_free_wakeup(). Previously, kmem_free_wakeup() always
called wakeup(). In general, no one was sleeping.
o Export vm_map_unlock_and_wait() and vm_map_wakeup() from vm_map.c
for use in vm_kern.c.


99683 09-Jul-2002 alc

o Lock accesses to the page queues in vm_object_terminate().
o Eliminate some unnecessary 64-bit arithmetic in vm_object_split().


99625 08-Jul-2002 peter

vm_page_queue_free_mtx is a spin mutex, not a normal sleep mutex.
I do not know why this didn't panic my box, but I have most certainly
been using it:
peter@overcee[3:14pm]~src/sys/i386/i386-110> sysctl -a | grep zero
vm.stats.misc.zero_page_count: 2235
vm.stats.misc.cnt_prezero: 638951
vm.idlezero_enable: 1
vm.idlezero_maxrun: 16

Submitted by: Tor.Egge@cvsup.no.freebsd.org
Approved by: Tor's patches are never wrong. :-)


99624 08-Jul-2002 peter

Turn the zeroidle process off for SMP systems, there is still a possible
TLB problem when bouncing from one cpu to another (the original cpu will
not have purged its TLB if the it simply went idle).

Pointed out by: Tor.Egge@cvsup.no.freebsd.org
Approved by: Tor is never wrong. :-)


99571 08-Jul-2002 peter

Add a special page zero entry point intended to be called via the single
threaded VM pagezero kthread outside of Giant. For some platforms, this
is really easy since it can just use the direct mapped region. For others,
IPI sending is involved or there are other issues, so grab Giant when
needed.

We still have preemption issues to deal with, but Alan Cox has an
interesting suggestion on how to minimize the problem on x86.

Use Luigi's hack for preserving the (lack of) priority.

Turn the idle zeroing back on since it can now actually do something useful
outside of Giant in many cases.


99563 08-Jul-2002 peter

Avoid vm_page_lookup() [grabs a spinlock] and just process the upage
object memq instead.

Suggested by: alc


99559 07-Jul-2002 peter

Collect all the (now equivalent) pmap_new_proc/pmap_dispose_proc/
pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code
and use a single equivalent MI version. There are other cleanups
needed still.

While here, use the UMA zone hooks to keep a cache of preinitialized
proc structures handy, just like the thread system does. This eliminates
one dependency on 'struct proc' being persistent even after being freed.
There are some comments about things that can be factored out into
ctor/dtor functions if it is worth it. For now they are mostly just
doing statistics to get a feel of how it is working.


99545 07-Jul-2002 alc

o Lock accesses to the free queue(s) in vm_page_zero_idle().


99514 07-Jul-2002 alc

o Traverse the object's memq rather than repeatedly calling vm_page_lookup()
in vm_object_split().


99509 06-Jul-2002 jeff

- Hold a lock on the vnode acquired from the file table across the call to
vm_mmap() as well as the GETATTR etc.
- If the handle is a vnode in vm_mmap() assert that it is locked.
- Wiggle Giant around a little to account for the extra vnode operation.


99476 05-Jul-2002 gallatin

Remove bogus vm_page_wakeup() in vm_page_cowfault() that will cause panics
in the zero-copy send path if a process attempts to write to a page
which is still in flight.

reviewed by: ken


99472 05-Jul-2002 jeff

Fix a lock order reversal in uma_zdestroy. The uma_mtx needs to be held across
calls to zone_drain().

Noticed by: scottl


99427 05-Jul-2002 alc

o Lock accesses to the free page queues in contigmalloc1().


99424 05-Jul-2002 jeff

Remove unnecessary includes.


99416 04-Jul-2002 alc

o Resurrect vm_page_lock_queues(), vm_page_unlock_queues(), and the free
queue lock (revision 1.33 of vm/vm_page.c removed them).
o Make the free queue lock a spin lock because it's sometimes acquired
inside of a critical section.


99408 04-Jul-2002 julian

A small cleanup.


99407 04-Jul-2002 julian

Don;t call teh thread setup routines from here..
they are already called when uma calls thread_init()


99374 03-Jul-2002 alc

o Make the reservation of KVA space for kernel map entries a function
of the KVA space's size in addition to the amount of physical memory
and reduce it by a factor of two.

Under the old formula, our reservation amounted to one kernel map entry
per virtual page in the KVA space on a 4GB i386.


99320 03-Jul-2002 jeff

Actually use the fini callback.

Pointy hat to: me :-(
Noticed By: Julian


99211 01-Jul-2002 robert

- Use (OFF_TO_IDX(off) - pi) instead of (OFF_TO_IDX(off - IDX_TO_OFF(pi))).
- Reformat a comment.


99196 01-Jul-2002 alc

o Remove some long dead code: from revision 1.41 of vm/vm_pager.c
3+ years ago.
o Remove some unused prototypes.


99093 29-Jun-2002 iedowse

Change the type of `tscan' in vm_object_page_clean() to vm_pindex_t,
as it stores an absolute page index that may not fit in a vm_offset_t.


99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


98892 26-Jun-2002 iedowse

Avoid using the 64-bit vm_pindex_t in a few places where 64-bit
types are not required, as the overhead is unnecessary:

o In the i386 pmap_protect(), `sindex' and `eindex' represent page
indices within the 32-bit virtual address space.
o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary
variable to store the low few bits of a vm_pindex_t that gets used
as an array index.
o vm_uiomove() uses `osize' and `idx' for page offsets within a
map entry.
o In vm_object_split(), `idx' is a page offset within a map entry.


98891 26-Jun-2002 iedowse

Use an explicit cast to avoid relying on sign extension to do the
right thing in code such as `vm_pindex_t x = ~SWAP_META_MASK'.

Reviewed by: dillon


98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


98848 26-Jun-2002 dillon

Enforce RLIMIT_VMEM on growable mappings (aka the primary stack or any
MAP_STACK mapping).

Suggested by: alc


98833 26-Jun-2002 dillon

Part I of RLIMIT_VMEM implementation. Implement core functionality for
a new resource limit that covers a process's entire VM space, including
mmap()'d space.

(Part II will be additional code to check RLIMIT_VMEM during exec() but it
needs more fleshing out).

PR: kern/18209
Submitted by: Andrey Alekseyev <uitm@zenon.net>, Dmitry Kim <jason@nichego.net>
MFC after: 7 days


98824 25-Jun-2002 iedowse

Complete the initial set of VM changes required to support full
64-bit file sizes. This step simply addresses the remaining overflows,
and does attempt to optimise performance. The details are:

o Use a 64-bit type for the vm_object `size' and the size argument
to vm_object_allocate().
o Use the correct type for index variables in dev_pager_getpages(),
vm_object_page_clean() and vm_object_page_remove().
o Avoid an overflow in the i386 pmap_object_init_pt().


98823 25-Jun-2002 jeff

Turn VM_ALLOC_ZERO into a flag.

Submitted by: tegge
Reviewed by: dillon


98822 25-Jun-2002 jeff

Reduce the amount of code that runs with the zone lock held in slab_zalloc().
This allows us to run the zone initialization functions without any locks held.


98818 25-Jun-2002 alc

o Eliminate vmspace::vm_minsaddr. It's initialized but never used.
o Replace stale comments in vmspace by "const until freed" annotations
on some fields.


98686 23-Jun-2002 alc

o Remove GIANT_REQUIRED from kmem_alloc_pageable(), kmem_alloc_nofault(),
and kmem_free(). (Annotate as MPSAFE.)
o Remove incorrect casts from kmem_alloc_pageable() and kmem_alloc_nofault().


98656 23-Jun-2002 alc

o Remove the unnecessary acquisition and release of Giant around fdrop()
in mmap(2).


98632 22-Jun-2002 alc

o Reduce the scope of Giant in vm_mmap() to just the code that manipulates
a vnode. (Thus, MAP_ANON and MAP_STACK never acquire Giant.)


98630 22-Jun-2002 alc

o Replace mtx_assert(&Giant, MA_OWNED) in dev_pager_alloc()
with the acquisition and release of Giant. (Annotate as MPSAFE.)
o Reorder the sanity checks in dev_pager_alloc() to reduce
the time that Giant is held.


98624 22-Jun-2002 alc

o In vm_map_insert(), replace GIANT_REQUIRED by the acquisition and
release of Giant around the direct manipulation of the vm_object and
the optional call to pmap_object_init_pt().
o In vm_map_findspace(), remove GIANT_REQUIRED. Instead, acquire and
release Giant around the occasional call to pmap_growkernel().
o In vm_map_find(), remove GIANT_REQUIRED.


98607 22-Jun-2002 alc

o Replace GIANT_REQUIRED in swap_pager_alloc() by the acquisition and
release of Giant. (Annotate as MPSAFE.)


98605 22-Jun-2002 alc

o Remove GIANT_REQUIRED from phys_pager_alloc(). If handle isn't NULL,
acquire and release Giant. If handle is NULL, Giant isn't needed.
o Annotate phys_pager_alloc() and phys_pager_dealloc() as MPSAFE.


98604 22-Jun-2002 alc

o Replace GIANT_REQUIRED in vnode_pager_alloc() by the acquisition and
release of Giant. (Annotate as MPSAFE.)
o Also, in vnode_pager_alloc(), remove an unnecessary re-initialization
of struct vm_object::flags and move a statement that is duplicated
in both branches of an if-else.


98600 22-Jun-2002 alc

o Remove GIANT_REQUIRED from vslock().
o Annotate kernacc(), useracc(), and vslock() as MPSAFE.

Motivated by: alfred


98541 21-Jun-2002 alc

o Remove GIANT_REQUIRED from vm_map_stack().


98538 21-Jun-2002 alc

o Remove GIANT_REQUIRED from vm_pager_allocate() and vm_pager_deallocate().


98498 20-Jun-2002 alc

o Remove an incorrect cast from obreak(). This cast would,
for example, break an sbrk(>=4GB) on 64-bit architectures
even if the resource limit allowed it.
o Correct an off-by-one error.
o Correct a spelling error in a comment.
o Reorder an && expression so that the commonly FALSE expression
comes first.

Submitted by: bde (bullets 1 and 2)


98460 20-Jun-2002 alc

o Acquire and release the vm_map lock instead of Giant in obreak().
Consequently, use vm_map_insert() and vm_map_delete(), which expect
the vm_map to be locked, instead of vm_map_find() and vm_map_remove(),
which do not.


98455 19-Jun-2002 jeff

- Move the computation of pflags out of the page allocation loop in
kmem_malloc()
- zero fill pages if PG_ZERO bit is not set after allocation in kmem_malloc()

Suggested by: alc, jake


98451 19-Jun-2002 jeff

- Remove bogus use of kmem_alloc that was inherited from the old zone
allocator.
- Properly set M_ZERO when talking to the back end page allocators for
non malloc zones. This forces us to zero fill pages when they are first
brought into a cache.
- Properly handle M_ZERO in uma_zalloc_internal. This fixes a problem where
per cpu buckets weren't always getting zeroed.


98450 19-Jun-2002 jeff

Teach kmem_malloc about M_ZERO.


98414 19-Jun-2002 alc

o Replace GIANT_REQUIRED in vm_object_coalesce() by the acquisition and
release of Giant.
o Reduce the scope of GIANT_REQUIRED in vm_map_insert().

These changes will enable us to remove the acquisition and release
of Giant from obreak().


98397 18-Jun-2002 alc

o Remove LK_CANRECURSE from the vm_map lock.


98362 17-Jun-2002 jeff

Honor the BUCKETCACHE flag on free as well.


98361 17-Jun-2002 jeff

- Introduce the new M_NOVM option which tells uma to only check the currently
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.

- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.

- Use the ZONE_VM flag on vm map entries and pv entries on x86.


98343 17-Jun-2002 alc

o Acquire and release Giant in vm_map_wakeup() to prevent
a lost wakeup().

Reviewed by: tegge


98304 16-Jun-2002 alc

o Remove GIANT_REQUIRED from vm_fault_user_wire().
o Move pmap_pageable() outside of Giant in vm_fault_unwire().
(pmap_pageable() is a no-op on all supported architectures.)
o Remove the acquisition and release of Giant from mlock().


98263 15-Jun-2002 alc

o Remove GIANT_REQUIRED from useracc() and vsunlock(). Neither
vm_map_check_protection() nor vm_map_unwire() expect Giant
to be held.


98240 15-Jun-2002 alc

o Remove the acquisition and release of Giant from munlock().

Reviewed by: tegge


98226 14-Jun-2002 alc

o Use vm_map_wire() and vm_map_unwire() in place of vm_map_pageable() and
vm_map_user_pageable().
o Remove vm_map_pageable() and vm_map_user_pageable().
o Remove vm_map_clear_recursive() and vm_map_set_recursive(). (They were
only used by vm_map_pageable() and vm_map_user_pageable().)

Reviewed by: tegge


98142 12-Jun-2002 alc

o Acquire and release Giant in vm_map_unlock_and_wait().

Submitted by: tegge


98119 11-Jun-2002 alc

o Properly handle a failure by vm_fault_wire() or vm_fault_user_wire()
in vm_map_wire().
o Make two white-space changes in vm_map_wire().

Reviewed by: tegge


98109 11-Jun-2002 alc

o Teach vm_map_delete() to respect the "in-transition" flag
on a vm_map_entry by sleeping until the flag is cleared.

Submitted by: tegge


98083 10-Jun-2002 alc

o In vm_map_entry_create(), call uma_zalloc() with M_NOWAIT on system maps.
Submitted by: tegge
o Eliminate the "!mapentzone" check from vm_map_entry_create() and
vm_map_entry_dispose(). Reviewed by: tegge
o Fix white-space usage in vm_map_entry_create().


98075 10-Jun-2002 iedowse

Correct the logic for determining whether the per-CPU locks need
to be destroyed. This fixes a problem where destroying a UMA zone
would fail to destroy all zone mutexes.

Reviewed by: jeff


98071 09-Jun-2002 alc

o Add vm_map_wire() for wiring contiguous regions of either kernel
or user vm_maps. This implementation has two key benefits when compared
to vm_map_{user_,}pageable(): (1) it avoids a race condition through
the use of "in-transition" vm_map entries and (2) it eliminates lock
recursion on the vm_map.

Note: there is still an error case that requires clean up.

Reviewed by: tegge


98052 08-Jun-2002 alc

o Simplify vm_map_unwire() by merging the second and third passes
over the caller-specified region.


98036 08-Jun-2002 alc

o Remove an unnecessary call to vm_map_wakeup() from vm_map_unwire().
o Add a stub for vm_map_wire().

Note: the description of the previous commit had an error. The in-
transition flag actually blocks the deallocation of a vm_map_entry by
vm_map_delete() and vm_map_simplify_entry().


98022 07-Jun-2002 alc

o Add vm_map_unwire() for unwiring contiguous regions of either kernel
or user vm_maps. In accordance with the standards for munlock(2),
and in contrast to vm_map_user_pageable(), this implementation does not
allow holes in the specified region. This implementation uses the
"in transition" flag described below.
o Introduce a new flag, "in transition," to the vm_map_entry.
Eventually, vm_map_delete() and vm_map_simplify_entry() will respect
this flag by deallocating in-transition vm_map_entrys, allowing
the vm_map lock to be safely released in vm_map_unwire() and (the
forthcoming) vm_map_wire().
o Modify vm_map_simplify_entry() to respect the in-transition flag.

In collaboration with: tegge


97947 06-Jun-2002 alfred

fix typo in _SYS_SYSPROTO_H_ case: s/mlockall_args/munlockall_args

Submitted by: Mark Santcroos <marks@ripe.net>


97787 03-Jun-2002 jeff

Add a comment describing a resource leak that occurs during a failure case
in obj_alloc.


97753 02-Jun-2002 alc

o Migrate vm_map_split() from vm_map.c to vm_object.c, renaming it
to vm_object_split(). Its interface should still be changed
to resemble vm_object_shadow().


97747 02-Jun-2002 alc

o Style fixes to vm_map_split(), including the elimination of one variable
declaration that shadows another.

Note: This function should really be vm_object_split(), not vm_map_split().

Reviewed by: md5


97729 02-Jun-2002 alc

o Condition vm_object_pmap_copy_1()'s compilation on the kernel
option ENABLE_VFS_IOOPT. Unless this option is in effect,
vm_object_pmap_copy_1() is not used.


97727 01-Jun-2002 alc

o Remove GIANT_REQUIRED from vm_map_zfini(), vm_map_zinit(),
vm_map_create(), and vm_map_submap().
o Make further use of a local variable in vm_map_entry_splay()
that caches a reference to one of a vm_map_entry's children.
(This reduces code size somewhat.)
o Revert a part of revision 1.66, deinlining vmspace_pmap().
(This function is MPSAFE.)


97710 01-Jun-2002 alc

o Revert a part of revision 1.66, contrary to what that commit message says,
deinlining vm_map_entry_behavior() and vm_map_entry_set_behavior()
actually increases the kernel's size.
o Make vm_map_entry_set_behavior() static and add a comment describing
its purpose.
o Remove an unnecessary initialization statement from vm_map_entry_splay().


97654 31-May-2002 des

Export nswapdev through sysctl(8).

Sponsored by: DARPA, NAI Labs


97648 31-May-2002 alc

Further work on pushing Giant out of the vm_map layer and down
into the vm_object layer:
o Acquire and release Giant in vm_object_shadow() and
vm_object_page_remove().
o Remove the GIANT_REQUIRED assertion preceding vm_map_delete()'s call
to vm_object_page_remove().
o Remove the acquisition and release of Giant around vm_map_lookup()'s
call to vm_object_shadow().


97556 30-May-2002 alfred

Check for defined(__i386__) instead of just defined(i386) since the compiler
will be updated to only define(__i386__) for ANSI cleanliness.


97453 29-May-2002 peter

The kernel printf does not have %i


97359 27-May-2002 alc

o Remove unused #defines.


97294 26-May-2002 alc

o Acquire and release Giant around pmap operations in vm_fault_unwire()
and vm_map_delete(). Assert GIANT_REQUIRED in vm_map_delete()
only if operating on the kernel_object or the kmem_object.
o Remove GIANT_REQUIRED from vm_map_remove().
o Remove the acquisition and release of Giant from munmap().


97198 24-May-2002 alc

o Replace the vm_map's hint by the root of a splay tree. By design,
the last accessed datum is moved to the root of the splay tree.
Therefore, on lookups in which the hint resulted in O(1) access,
the splay tree still achieves O(1) access. In contrast, on lookups
in which the hint failed miserably, the splay tree achieves amortized
logarithmic complexity, resulting in dramatic improvements on vm_maps
with a large number of entries. For example, the execution time
for replaying an access log from www.cs.rice.edu against the thttpd
web server was reduced by 23.5% due to the large number of files
simultaneously mmap()ed by this server. (The machine in question has
enough memory to cache most of this workload.)

Nothing comes for free: At present, I see a 0.2% slowdown on "buildworld"
due to the overhead of maintaining the splay tree. I believe that
some or all of this can be eliminated through optimizations
to the code.

Developed in collaboration with: Juan E Navarro <jnavarro@cs.rice.edu>
Reviewed by: jeff


97088 22-May-2002 alc

o Make contigmalloc1() static.


97007 20-May-2002 jhb

In uma_zalloc_arg(), if we are performing a M_WAITOK allocation, ensure
that td_intr_nesting_level is 0 (like malloc() does). Since malloc() calls
uma we can probably remove the check in malloc() for this now. Also,
perform an extra witness check in that case to make sure we don't hold
any locks when performing a M_WAITOK allocation.


96875 18-May-2002 alc

o Eliminate the acquisition and release of Giant from minherit(2).
(vm_map_inherit() no longer requires Giant to be held.)


96839 18-May-2002 alc

o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and
release Giant around vm_map_madvise()'s call to pmap_object_init_pt().
o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition
and release of Giant.
o Remove the acquisition and release of Giant from madvise().


96832 18-May-2002 alc

o Remove the acquisition and release of Giant from mprotect().


96755 16-May-2002 trhodes

More s/file system/filesystem/g


96572 14-May-2002 phk

Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by: DARPA & NAI Labs.


96496 13-May-2002 jeff

Don't call the uz free function while the zone lock is held. This can lead
to lock order reversals. uma_reclaim now builds a list of freeable slabs and
then unlocks the zones to do all of the frees.


96493 13-May-2002 jeff

Remove the hash_free() lock order reversal. This could have happened for
several reasons before. Fixing it involved restructuring the generic hash
code to require calling code to handle locking, unlocking, and freeing hashes
on error conditions.


96469 12-May-2002 alc

o Remove GIANT_REQUIRED and an excessive number of blank lines
from vm_map_inherit(). (minherit() need not acquire Giant
anymore.)


96441 12-May-2002 alc

o Acquire and release Giant in vm_object_reference() and
vm_object_deallocate(), replacing the assertion GIANT_REQUIRED.
o Remove GIANT_REQUIRED from vm_map_protect() and vm_map_simplify_entry().
o Acquire and release Giant around vm_map_protect()'s call to pmap_protect().

Altogether, these changes eliminate the need for mprotect() to acquire
and release Giant.


96096 06-May-2002 alc

o Header files shouldn't depend on options: Provide prototypes
for uiomoveco(), uioread(), and vm_uiomove() regardless
of whether ENABLE_VFS_IOOPT is defined or not.

Submitted by: bde


96095 06-May-2002 alc

o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.


96091 06-May-2002 alc

o Some improvements to the page coloring of vm objects, particularly,
for shadow objects.

Submitted by: bde


96087 06-May-2002 alc

o Move vm_freeze_copyopts() from vm_map.{c.h} to vm_object.{c,h}. It's plainly
an operation on a vm_object and belongs in the latter place.


96080 05-May-2002 alc

o Condition the compilation of uiomoveco() and vm_uiomove()
on ENABLE_VFS_IOOPT.
o Add a comment to the effect that this code is experimental
support for zero-copy I/O.


96073 05-May-2002 phk

Expand the one-line function pbreassignbuf() the only place it is or could
be used.


96056 05-May-2002 alc

o Remove GIANT_REQUIRED from vm_map_lookup() and vm_map_lookup_done().
o Acquire and release Giant around vm_map_lookup()'s call
to vm_object_shadow().


96044 04-May-2002 jeff

Use pages instead of uz_maxpages, which has not been initialized yet, when
creating the vm_object. This was broken after the code was rearranged to
grab giant itself.

Spotted by: alc


96042 04-May-2002 alc

o Make _vm_object_allocate() and vm_object_allocate() callable
without holding Giant.
o Begin documenting the trivial cases of the locking protocol
on vm_object.


96007 04-May-2002 alc

o Remove GIANT_REQUIRED from vm_map_lookup_entry() and
vm_map_check_protection().
o Call vm_map_check_protection() without Giant held in munmap().


95942 02-May-2002 alc

o Change the implementation of vm_map locking to use exclusive locks
exclusively. The interface still, however, distinguishes
between a shared lock and an exclusive lock.


95931 02-May-2002 jeff

Hide a pointer to the malloc_type bucket at the end of the freed memory. If
this memory is modified after it has been freed we can now report it's
previous owner.


95930 02-May-2002 jeff

Move around the dbg code a bit so it's always under a lock. This stops a
weird potential race if we were preempted right as we were doing the dbg
checks.


95925 02-May-2002 arr

- Changed the size element of uma_zctor_args to be size_t instead of int.
- Changed uma_zcreate to accept the size argument as a size_t intead of
int.

Approved by: jeff


95923 02-May-2002 jeff

malloc/free(9) no longer require Giant. Use the malloc_mtx to protect the
mallochash. Mallochash is going to go away as soon as I introduce the
kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen
until all users of the malloc api that expect memory to be aligned on the size
of the allocation are fixed.


95901 02-May-2002 alc

o Remove dead and lockmgr()-specific debugging code.


95899 02-May-2002 jeff

Remove the temporary alignment check in free().

Implement the following checks on freed memory in the bucket path:
- Slab membership
- Alignment
- Duplicate free

This previously was only done if we skipped the buckets. This code will slow
down INVARIANTS a bit, but it is smp safe. The checks were moved out of the
normal path and into hooks supplied in uma_dbg.


95823 30-Apr-2002 alc

o Convert the vm_page buckets mutex to a spin lock. (This resolves
an issue on the Alpha platform found by jeff@.)
o Simplify vm_page_lookup().

Reviewed by: jhb


95771 30-Apr-2002 jeff

Add a new UMA debugging facility. This will overwrite freed memory with
0xdeadc0de and then check for it just before memory is handed off as part
of a new request. This will catch any post free/pre alloc modification of
memory, as well as introduce errors for anything that tries to dereference
it as a pointer.

This code takes the form of special init, fini, ctor and dtor routines that
are specificly used by malloc. It is in a seperate file because additional
debugging aids will want to live here as well.


95766 30-Apr-2002 jeff

Move the implementation of M_ZERO into UMA so that it can be passed to
uma_zalloc and friends. Remove this functionality from the malloc wrapper.

Document this change in uma.h and adjust variable names in uma_core.


95764 30-Apr-2002 alc

o Revert vm_fault1() to its original name vm_fault(), eliminating the wrapper
that took its place for the purposes of acquiring and releasing Giant.


95758 29-Apr-2002 jeff

Add a new zone flag UMA_ZONE_MTXCLASS. This puts the zone in it's own
mutex class. Currently this is only used for kmapentzone because kmapents
are are potentially allocated when freeing memory. This is not dangerous
though because no other allocations will be done while holding the
kmapentzone lock.


95710 29-Apr-2002 peter

Tidy up some loose ends.
i386/ia64/alpha - catch up to sparc64/ppc:
- replace pmap_kernel() with refs to kernel_pmap
- change kernel_pmap pointer to (&kernel_pmap_store)
(this is a speedup since ld can set these at compile/link time)
all platforms (as suggested by jake):
- gc unused pmap_reference
- gc unused pmap_destroy
- gc unused struct pmap.pm_count
(we never used pm_count - we track address space sharing at the vmspace)


95701 29-Apr-2002 alc

Document three synchronization issues in vm_fault().


95686 28-Apr-2002 alc

Pass the caller's file name and line number to the vm_map locking functions.


95610 28-Apr-2002 alc

o Introduce and use vm_map_trylock() to replace several direct uses
of lockmgr().
o Add missing synchronization to vmspace_swap_count(): Obtain a read lock
on the vm_map before traversing it.


95598 28-Apr-2002 peter

We do not necessarily need to map/unmap pages to zero parts of them.
On systems where physical memory is also direct mapped (alpha, sparc,
ia64 etc) this is slightly harmful.


95589 27-Apr-2002 alc

o Begin documenting the (existing) locking protocol on the vm_map
in the same style as sys/proc.h.
o Undo the de-inlining of several trivial, MPSAFE methods on the vm_map.
(Contrary to the commit message for vm_map.h revision 1.66 and vm_map.c
revision 1.206, de-inlining these methods increased the kernel's size.)


95532 26-Apr-2002 alc

o Control access to the vm_page_buckets with a mutex.
o Fix some style(9) bugs.


95432 25-Apr-2002 arr

- Fix a round down bogon in uma_zone_set_max().

Submitted by: jeff@


95112 20-Apr-2002 alc

Reintroduce locking on accesses to vm_object_list.


95021 19-Apr-2002 alc

o Move the acquisition of Giant from vm_fault() to the point
after initialization in vm_fault1().
o Fix some style problems in vm_fault1().


94981 18-Apr-2002 alc

Add a comment documenting a race condition in vm_fault(): Specifically, a
modification is made to the vm_map while only a read lock is held.


94977 18-Apr-2002 alc

o Call vm_map_growstack() from vm_fault() if vm_map_lookup() has failed
due to conditions that suggest the possible need for stack growth.
This has two beneficial effects: (1) we can
now remove calls to vm_map_growstack() from the MD trap handlers and (2)
simple page faults are faster because we no longer unnecessarily perform
vm_map_growstack() on every page fault.
o Remove vm_map_growstack() from the i386's trap_pfault().
o Remove the acquisition and release of Giant from i386's trap_pfault().
(vm_fault() still acquires it.)


94921 17-Apr-2002 peter

Do not free the vmspace until p->p_vmspace is set to null. Otherwise
statclock can access it in the tail end of statclock_process() at an
unfortunate time. This bit me several times on an SMP alpha (UP2000)
and the problem went away with this change. I'm not sure why it doesn't
break x86 as well. Maybe it's because the clocks are much faster
on alpha (HZ=1024 by default).


94912 17-Apr-2002 alc

Remove an unused option, VM_FAULT_HOLD, to vm_fault().


94777 15-Apr-2002 peter

Pass vm_page_t instead of physical addresses to pmap_zero_page[_area]()
and pmap_copy_page(). This gets rid of a couple more physical addresses
in upper layers, with the eventual aim of supporting PAE and dealing with
the physical addressing mostly within pmap. (We will need either 64 bit
physical addresses or page indexes, possibly both depending on the
circumstances. Leaving this to pmap itself gives more flexibilitly.)

Reviewed by: jake
Tested on: i386, ia64 and (I believe) sparc64. (my alpha was hosed)


94653 14-Apr-2002 jeff

Fix a witness warning when expanding a hash table. We were allocating the new
hash while holding the lock on a zone. Fix this by doing the allocation
seperately from the actual hash expansion.

The lock is dropped before the allocation and reacquired before the expansion.
The expansion code checks to see if we lost the race and frees the new hash
if we do. We really never will lose this race because the hash expansion is
single threaded via the timeout mechanism.


94651 14-Apr-2002 jeff

Protect the initial list traversal in sysctl_vm_zone() with the uma_mtx.


94631 14-Apr-2002 jeff

Fix the calculation that determines uz_maxpages. It was off for large zones.
Fortunately we have no large zones with maximums specified yet, so it wasn't
breaking anything.

Implement blocking when a zone exceeds the maximum and M_WAITOK is specified.
Previously this just failed like the old zone allocator did. The old zone
allocator didn't support WAITOK/NOWAIT though so we should do what we
advertise.

While I was in there I cleaned up some more zalloc logic to further simplify
that code path and reduce redundant code. This was needed to make the blocking
work properly anyway.


94329 10-Apr-2002 jeff

Remember to unlock the zone if the fill count is too high.

Pointed out by: pete, jake, jhb


94240 08-Apr-2002 jeff

Quiet witness warnings about acquiring several zone locks. In the case that
this happens it is OK.


94165 08-Apr-2002 jeff

Add a mechanism to disable buckets when the v_free_count drops below
v_free_min. This should help performance in memory starved situations.


94163 08-Apr-2002 jeff

Don't release the zone lock until after the dtor has been called. As far as I
can tell this could not have caused any problems yet because UMA is still
called with giant.

Pointy hat to: jeff
Noticed by: jake


94161 08-Apr-2002 jeff

Implement uma_zdestroy(). It's prototype changed slightly. I decided that I
didn't like the wait argument and that if you were removing a zone it had
better be empty.

Also, I broke out part of hash_expand and made a seperate hash_free() for use
in uma_zdestroy.


94159 08-Apr-2002 jeff

Rework most of the bucket allocation and free code so that per cpu locks are
never held across blocking operations. Also, fix two other lock order
reversals that were exposed by jhb's witness change.

The free path previously had a bug that would cause it to skip the free bucket
list in some cases and go straight to allocating a new bucket. This has been
fixed as well.

These changes made the bucket handling code much cleaner and removed quite a
few lock operations. This should be marginally faster now.

It is now possible to call malloc w/o Giant and avoid any witness warnings.
This still isn't entirely safe though because malloc_type statistics are not
protected by any lock.


94157 07-Apr-2002 jeff

Spelling correction; s/seperate/separate/g

Submitted by: eric


94156 07-Apr-2002 jeff

There should be no remaining references to these two files in the tree. If
there are, it is an error. vm_zone has been superseded by uma.


94155 07-Apr-2002 jeff

This fixes a bug where isitem never got set to 1 if a certain chain of events
relating to extreme low memory situations occured. This was only ever seen on
the port build cluster, so many thanks to kris for helping me debug this.

Tested by: kris


93847 05-Apr-2002 alc

o Eliminate the use of grow_stack() and useracc() from sendsig(), osendsig(),
and osf1_sendsig().
o Eliminate the prototype for the MD grow_stack() now that it has been removed
from all platforms.


93823 04-Apr-2002 dillon

Embed a struct vmmeter in the per-cpu structure and add a macro,
PCPU_LAZY_INC() which increments elements in it for cases where we
can afford the occassional inaccuracy. Use of per-cpu stats counters
avoids significant cache stalls in various critical paths that would
otherwise severely limit our cpu scaleability.

Adjust all sysctl's accessing cnt.* elements to now use a procedure
which aggregates the requested field for all cpus and for the global
vmmeter.

The global vmmeter is retained, since some stats counters, like v_free_min,
cannot be made per-cpu. Also, this allows us to convert counters from
the global vmmeter to the per-cpu vmmeter in a piecemeal fashion, so
have at it!


93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


93716 03-Apr-2002 jake

Fix a long standing 32bit-ism. Don't assume that the size of a chunk of
memory in phys_avail will fit in 'int', use vm_size_t. This fixes booting
on sparc64 machines with more than 2 gigs of ram.

Thanks to Jan Chrillesen for providing me with access to a 4 gig machine.


93697 02-Apr-2002 alfred

fix comment typo, s/neccisary/necessary/g


93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


93273 27-Mar-2002 jeff

Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by: jhb


93194 26-Mar-2002 alc

Remove an unused prototype.


93089 24-Mar-2002 jeff

Reset the cachefree statistics after draining the cache. This fixes a bug
where a sysctl within 20 seconds of a cache_drain could yield negative "USED"
counts.

Also, grab the uma_mtx while in the sysctl handler. This hadn't caused
problems yet because Giant is held all the time.

Reported by: kkenn


92758 20-Mar-2002 jeff

Add uma_zone_set_max() to add enforced limits to non vm obj backed zones.


92748 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.


92727 19-Mar-2002 alfred

Remove __P.


92692 19-Mar-2002 jeff

Quit a warning introduced by UMA. This only occurs on machines where
vm_size_t != unsigned long.

Reviewed by: phk


92666 19-Mar-2002 peter

Fix a gcc-3.1+ warning.
warning: deprecated use of label at end of compound statement

ie: you cannot do this anymore:
switch(foo) {
....

default:
}


92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


92588 18-Mar-2002 green

Back out the modification of vm_map locks from lockmgr to sx locks. The
best path forward now is likely to change the lockmgr locks to simple
sleep mutexes, then see if any extra contention it generates is greater
than removed overhead of managing local locking state information,
cost of extra calls into lockmgr, etc.

Additionally, making the vm_map lock a mutex and respecting it properly
will put us much closer to not needing Giant magic in vm.


92511 17-Mar-2002 alc

Remove vm_object_count: It's unused, incorrectly maintained and duplicates
information maintained by the zone allocator.


92475 17-Mar-2002 alc

Undo part of revision 1.57: Now that (o)sendsig() doesn't call useracc(),
the motivation for saving and restoring the map->hint in useracc() is gone.
(The same tests that motivated this change in revision 1.57 now show that
there is no performance loss from removing it.) This was really a hack and
some day we would have had to add new synchronization here on map->hint
to maintain it.


92466 17-Mar-2002 alc

Acquire a read lock on the map inside of vm_map_check_protection() rather
than expecting the caller to do so. This (1) eliminates duplicated code in
kernacc() and useracc() and (2) fixes missing synchronization in munmap().


92461 17-Mar-2002 jake

Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/
pmap_qremove. pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus. If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb. This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings. This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem. This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.

Reviewed by: peter


92363 15-Mar-2002 mckusick

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


92256 14-Mar-2002 green

Document faultstate.lookup_still_valid more than none.

Requested by: alfred


92246 13-Mar-2002 green

Rename SI_SUB_MUTEX to SI_SUB_MTX_POOL to make the name at all accurate.
While doing this, move it earlier in the sysinit boot process so that the
VM system can use it.

After that, the system is now able to use sx locks instead of lockmgr
locks in the VM system. To accomplish this, some of the more
questionable uses of the locks (such as testing whether they are
owned or not, as well as allowing shared+exclusive recursion) are
removed, and simpler logic throughout is used so locks should also be
easier to understand.

This has been tested on my laptop for months, and has not shown any
problems on SMP systems, either, so appears quite safe. One more
user of lockmgr down, many more to go :)


92029 10-Mar-2002 eivind

- Remove a number of extra newlines that do not belong here according to
style(9)
- Minor space adjustment in cases where we have "( ", " )", if(), return(),
while(), for(), etc.
- Add /* SYMBOL */ after a few #endifs.

Reviewed by: alc


91946 09-Mar-2002 tegge

Revert change in revision 1.53 and add a small comment to protect
the revived code.

vm pages newly allocated are marked busy (PG_BUSY), thus calling
vm_page_delete before the pages has been freed or unbusied will
cause a deadlock since vm_page_object_page_remove will wait for the
busy flag to be cleared. This can be triggered by calling malloc
with size > PAGE_SIZE and the M_NOWAIT flag on systems low on
physical free memory.

A kernel module that reproduces the problem, written by Logan Gabriel
<logan@mail.2cactus.com>, can be found in the freebsd-hackers mail
archive (12 Apr 2001). The problem was recently noticed again by
Archie Cobbs <archie@dellroad.org>.

Reviewed by: dillon


91777 07-Mar-2002 dillon

Fix a bug in the vm_map_clean() procedure. msync()ing an area of memory
that has just been mapped MAP_ANON|MAP_NOSYNC and has not yet been accessed
will panic the machine.

MFC after: 1 day


91724 06-Mar-2002 dillon

Add a sequential iteration optimization to vm_object_page_clean(). This
moderately improves msync's and VM object flushing for objects containing
randomly dirtied pages (fsync(), msync(), filesystem update daemon),
and improves cpu use for small-ranged sequential msync()s in the face of
very large mmap()ings from O(N) to O(1) as might be performed by a database.

A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two
committed optimizations are turned on by default). 0 will turn off both
optimizations.

This code has already been tested under stable and is one in a series of
memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart
conditions that will be coming down the pipe.

MFC after: 3 days


91700 05-Mar-2002 eivind

* Move bswlist declaration and initialization from kern/vfs_bio.c to
vm/vm_pager.c, which is the only place it is used.
* Make the QUEUE_* definitions and bufqueues local to vfs_bio.c.
* constify buf_wmesg.


91641 04-Mar-2002 alc

o Create vm_pageq_enqueue() to encapsulate code that is duplicated time
and again in vm_page.c and vm_pageq.c.
o Delete unusused prototypes. (Mainly a result of the earlier renaming
of various functions from vm_page_*() to vm_pageq_*().)


91605 03-Mar-2002 alc

Call vm_pageq_remove_nowakeup() rather than duplicating it.


91569 02-Mar-2002 alc

Remove some long dead code.


91420 27-Feb-2002 jhb

Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.


91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


91403 27-Feb-2002 silby

Fix a horribly suboptimal algorithm in the vm_daemon.

In order to determine what to page out, the vm_daemon checks
reference bits on all pages belonging to all processes. Unfortunately,
the algorithm used reacted badly with shared pages; each shared page
would be checked once per process sharing it; this caused an O(N^2)
growth of tlb invalidations. The algorithm has been changed so that
each page will be checked only 16 times.

Prior to this change, a fork/sleepbomb of 1300 processes could cause
the vm_daemon to take over 60 seconds to complete, effectively
freezing the system for that time period. With this change
in place, the vm_daemon completes in less than a second. Any system
with hundreds of processes sharing pages should benefit from this change.

Note that the vm_daemon is only run when the system is under extreme
memory pressure. It is likely that many people with loaded systems saw
no symptoms of this problem until they reached the point where swapping
began.

Special thanks go to dillon, peter, and Chuck Cranor, who helped me
get up to speed with vm internals.

PR: 33542, 20393
Reviewed by: dillon
MFC after: 1 week


91367 27-Feb-2002 peter

Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.


91344 27-Feb-2002 peter

Jake further reduced IPI shootdowns on sparc64 in loops by using ranged
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.

Obtained from: jake (MI code, suggestions for MD part)


91263 26-Feb-2002 peter

Remove unused variable (td)


91063 22-Feb-2002 phk

GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.


90944 19-Feb-2002 tegge

Add a page queue, PQ_HOLD, that temporarily owns pages with nonzero hold
count that would otherwise be on one of the free queues. This eliminates a
panic when broken programs unmap memory that still has pending IO from raw
devices.

Reviewed by: dillon, alc


90937 19-Feb-2002 silby

Add one more comment to the OOM changes so that future readers of
the code may better understand the code.

Suggested by: dillon
MFC after: 1 week


90935 19-Feb-2002 silby

Changes to make the OOM killer much more effective:

- Allow the OOM killer to target processes currently locked in
memory. These very often are the ones doing the memory hogging.
- Drop the wakeup priority of processes currently sleeping while
waiting for their page fault to complete. In order for the OOM
killer to work well, the killed process and other system processes
waiting on memory must be allowed to wakeup first.

Reviewed by: dillon
MFC after: 1 week


90702 15-Feb-2002 bde

Garbage-collect options ACPI_NO_ENABLE_ON_BOOT, AML_DEBUG, BLEED,
DEVICE_SYSCTLS, KEY, LOUTB, NFS_MUIDHASHSIZ, NFS_UIDHASHSIZ, PCI_QUIET
and SIMPLELOCK_DEBUG.


90538 11-Feb-2002 julian

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


90263 05-Feb-2002 alfred

Fix a race with free'ing vmspaces at process exit when vmspaces are
shared.

Also introduce vm_endcopy instead of using pointer tricks when
initializing new vmspaces.

The race occured because of how the reference was utilized:
test vmspace reference,
possibly block,
decrement reference

When sharing a vmspace between multiple processes it was possible
for two processes exiting at the same time to test the reference
count, possibly block and neither one free because they wouldn't
see the other's update.

Submitted by: green


90033 31-Jan-2002 dillon

GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer
cache lockups for over a year now.

MFC after: 0 days


89802 25-Jan-2002 dwmalone

Remove a parameter name from a prototype.


89464 17-Jan-2002 bde

Don't declare vm_swapout() in the NO_SWAPPING case when it is not defined.

Fixed some style bugs.


89319 14-Jan-2002 alfred

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


89306 13-Jan-2002 alfred

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


88900 05-Jan-2002 jhb

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


88318 20-Dec-2001 dillon

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


87834 14-Dec-2001 dillon

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


87157 01-Dec-2001 luigi

vm/vm_kern.c: rate limit (to once per second) diagnostic printf when
you run out of mbuf address space.

kern/subr_mbuf.c: print a warning message when mb_alloc fails, again
rate-limited to at most once per second. This covers other
cases of mbuf allocation failures. Probably it also overlaps the
one handled in vm/vm_kern.c, so maybe the latter should go away.

This warning will let us gradually remove the printf that are scattered
across most network drivers to report mbuf allocation failures.
Those are potentially dangerous, in that they are not rate-limited and
can easily cause systems to panic.

Unless there is disagreement (which does not seem to be the case
judging from the discussion on -net so far), and because this is
sort of a safety bugfix, I plan to commit a similar change to STABLE
during the weekend (it affects kern/uipc_mbuf.c there).

Discussed-with: jlemon, silby and -net


86475 17-Nov-2001 jlemon

When laying out objects in a ZONE_INTERRUPT zone, allow them to cross
a page boundary, since we've already allocated all our contiguous kva
space up front. This eliminates some memory wastage, and allows us to
actually reach the # of objects were specified in the zinit() call.

Reviewed by: peter, dillon


86236 09-Nov-2001 dillon

Fix deadlock introduced in 1.73 (Jan 1998). The paging-in-progress count
on a vnode-backed object must be incremented *after* obtaining the vnode
lock. If it is bumped before obtaining the vnode lock we can deadlock
against vtruncbuf().

Submitted by: peter, ps
MFC after: 3 days


86092 05-Nov-2001 dillon

Adjust vnode_pager_input_smlfs() to not attempt to BMAP blocks beyond the
file EOF. This works around a bug in the ISOFS (CDRom) BMAP code which
returns bogus values for requests beyond the file EOF rather then returning
an error, resulting in either corrupt data being mmap()'d beyond the file EOF
or resulting in a seg-fault on the last page of a mmap()'d file (mmap()s of
CDRom files).

Reported by: peter / Yahoo
MFC after: 3 days


85762 31-Oct-2001 dillon

Don't let pmap_object_init_pt() exhaust all available free pages
(allocating pv entries w/ zalloci) when called in a loop due to
an madvise(). It is possible to completely exhaust the free page list and
cause a system panic when an expected allocation fails.


85541 26-Oct-2001 dillon

Move recently added procedure which was incorrectly placed within an
#ifdef DDB block.


85517 26-Oct-2001 dillon

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


85272 21-Oct-2001 dillon

Syntax cleanup and documentation, no operational changes.

MFC after: 1 day


85227 20-Oct-2001 iedowse

Move the code that computes the system load average from vm_meter.c
to kern_synch.c in preparation for adding some jitter to the
inter-sample time.

Note that the "vm.loadavg" sysctl still lives in vm_meter.c which
isn't the right place, but it is appropriate for the current (bad)
name of that sysctl.

Suggested by: jhb (some time ago)
Reviewed by: bde


85070 17-Oct-2001 dillon

contigmalloc1() could cause the vm_page_zero_count to become incorrect.
Properly track the count.

Submitted by: mark tinguely <tinguely@web.cs.ndsu.nodak.edu>


85016 15-Oct-2001 tegge

Don't use an uninitialized field reserved for callers in the bio structure
passed to swap_pager_strategy(). Instead, use a field reserved for drivers
and initialize it before usage.

Reviewed by: dillon


84933 14-Oct-2001 tegge

Don't remove all mappings of a swapped out process if the vm map contained
wired entries. vm_fault_unwire() depends on the mapping being intact.

Reviewed by: dillon


84932 14-Oct-2001 tegge

Fix locking violations during page wiring:

- vm map entries are not valid after the map has been unlocked.

- An exclusive lock on the map is needed before calling
vm_map_simplify_entry().

Fix cleanup after page wiring failure to unwire all pages that had been
successfully wired before the failure was detected.

Reviewed by: dillon


84869 13-Oct-2001 dillon

Makes contigalloc[1]() create the vm_map / underlying wired pages in the
kernel map and object in a manner that contigfree() is actually able to
free. Previously contigfree() freed up the KVA space but could not
unwire & free the underlying VM pages due to mismatched pageability between
the map entry and the VM pages.

Submitted by: Thomas Moestl <tmoestl@gmx.net>
Testing by: mark tinguely <tinguely@web.cs.ndsu.nodak.edu>
MFC after: 3 days


84854 12-Oct-2001 dillon

Finally fix the VM bug where a file whos EOF occurs in the middle of a page
would sometimes prevent a dirty page from being cleaned, even when synced,
resulting in the dirty page being re-flushed to disk every 30-60 seconds or
so, forever. The problem is that when the filesystem flushes a page to
its backing file it typically does not clear dirty bits representing areas
of the page that are beyond the file EOF. If the file is also mmap()'d and
a fault is taken, vm_fault (properly, is required to) set the vm_page_t->dirty
bits to VM_PAGE_BITS_ALL. This combination could leave us with an uncleanable,
unfreeable page.

The solution is to have the vnode_pager detect the edge case and manually
clear the dirty bits representing areas beyond the file EOF. The filesystem
does the rest and the page comes up clean after the write completes.

MFC after: 3 days


84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


84812 11-Oct-2001 jhb

Add missing includes of sys/ktr.h.


84783 10-Oct-2001 ps

Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by: peter
MFC after: 2 weeks


84488 04-Oct-2001 iedowse

Remove the SSLEEP case from the load average computation. This has
been a no-op for as long as our CVS history goes back. Processes in
state SSLEEP could only be counted if p_slptime == 0, but immediately
before loadav() is called, schedcpu() has just incremented p_slptime
on all SSLEEP processes.


83986 26-Sep-2001 rwatson

o Modify access control checks in mmap() to use securelevel_gt() instead
of direct variable access.

Obtained from: TrustedBSD Project


83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


83276 10-Sep-2001 peter

Rip some well duplicated code out of cpu_wait() and cpu_exit() and move
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.

XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.

Reviewed by: jake, tmm, dillon


82756 01-Sep-2001 jhb

Process priority is locked by the sched_lock, not the proc lock.


82699 31-Aug-2001 dillon

make swapon() MPSAFE (will adjust syscalls.master later)


82697 31-Aug-2001 dillon

mark obreak() and ovadvise() as being MPSAFE


82612 31-Aug-2001 dillon

Cleanup


82314 25-Aug-2001 peter

Implement idle zeroing of pages. I've been tinkering with this
on and off since John Dyson left his work-in-progress.

It is off by default for now. sysctl vm.zeroidle_enable=1 to turn it on.

There are some hacks here to deal with the present lack of preemption - we
yield after doing a small number of pages since we wont preempt otherwise.

This is basically Matt's algorithm [with hysteresis] with an idle process
to call it in a similar way it used to be called from the idle loop.

I cleaned up the includes a fair bit here too.


82290 24-Aug-2001 dillon

Remove support for the badly broken MAP_INHERIT (from -current only).


82127 22-Aug-2001 dillon

Move most of the kernel submap initialization code, including the
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area. i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.

Reviewed by: freebsd-smp, jake


82126 22-Aug-2001 dillon

KASSERT if vm_page_t->wire_count overflows.


81933 20-Aug-2001 dillon

Limit the amount of KVM reserved for the buffer cache and for swap-meta
information. The default limits only effect machines with > 1GB of ram
and can be overriden with two new kernel conf variables VM_SWZONE_SIZE_MAX
and VM_BCACHE_SIZE_MAX, or with loader variables kern.maxswzone and
kern.maxbcache. This has the effect of leaving more KVM available for
sizing NMBCLUSTERS and 'maxusers' and should avoid tripups where a sysad
adds memory to a machine and then sees the kernel panic on boot due to
running out of KVM.

Also change the default swap-meta auto-sizing calculation to allocate half
of what it was previously allocating. The prior defaults were way too high.
Note that we cannot afford to run out of swap-meta structures so we still
stay somewhat conservative here.


81399 10-Aug-2001 jhb

- Remove asleep(), await(), and M_ASLEEP.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.

Reviewed by: jasone, peter


81397 10-Aug-2001 jhb

- Remove asleep(), await(), and M_ASLEEP.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.

Reviewed by: jasone, peter


81148 05-Aug-2001 tmm

Add a missing semicolon to unbreak the kernel build with INVARIANTS
(which was unfortunately turned off in the confguration I used for the
last test build).

Spotted by: jake
Pointy hat to: tmm


81140 04-Aug-2001 jhb

Whitespace fixes.


81136 04-Aug-2001 tmm

Add a zdestroy() function to the zone allocator. This is needed for the
unload case of modules that use their own zones.
It has been tested with the nfs module.


81029 02-Aug-2001 alfred

Fixups for the initial allocation by dillon:
1) allocate fewer buckets
2) when failing to allocate swap zone, keep reducing the zone by
a third rather than a half in order to reduce the chance of
allocating way too little.

I also moved around some code for readability.

Suggested by: dillon
Reviewed by: dillon


80705 31-Jul-2001 jake

Oops. Last commit to vm_object.c should have got these files too.

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


80704 31-Jul-2001 jake

Remove the use of atomic ops to manipulate vm_object and vm_page flags.
Giant is required here, so they are superfluous.

Discussed with: dillon


80517 28-Jul-2001 iedowse

Permit direct swapping to NFS regular files using swapon(2). We
already allow this for NFS swap configured via BOOTP, so it is
known to work fine.

For many diskless configurations is is more flexible to have the
client set up swapping itself; it can recreate a sparse swap file
to save on server space for example, and it works with a non-NFS
root filesystem such as an in-kernel filesystem image.


80204 23-Jul-2001 assar

make vm_page_select_cache static

Requested by: bde


80089 21-Jul-2001 assar

(vm_page_select_cache): add prototype


79744 15-Jul-2001 benno

The i386-specific includes in this file were "fixed" by bracketing them with
#ifndef __alpha__. Fix this for the rest of the world by turning it into
#ifdef __i386__.

Reviewed by: obrien


79443 09-Jul-2001 des

Fix missing newline and terminator at the end of the vm.zone sysctl.


79273 05-Jul-2001 mjacob

Apply field bandages to the includes so compiles happen on alpha.


79265 05-Jul-2001 dillon

Move vm_page_zero_idle() from machine-dependant sections to a
machine-independant source file, vm/vm_zeroidle.c. It was exactly the
same for all platforms and updating them all was getting annoying.


79263 04-Jul-2001 dillon

Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)


79248 04-Jul-2001 dillon

Change inlines back into mainline code in preparation for mutexing. Also,
most of these inlines had been bloated in -current far beyond their
original intent. Normalize prototypes and function declarations to be ANSI
only (half already were). And do some general cleanup.

(kernel size also reduced by 50-100K, but that isn't the prime intent)


79242 04-Jul-2001 dillon

whitespace / register cleanup


79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


79127 03-Jul-2001 jhb

Fix a XXX comment by moving the initialization of the number of pbuf's
for the vnode pager to a new vnode pager init method instead of making it
a hack in getpages().


78622 22-Jun-2001 jhb

- Protect all accesses to nsw_[rw]count{,_{,a}sync} with the pbuf mutex.
- Don't drop the vm mutex while grabbing the pbuf mutex to manipulate
said variables.


78592 22-Jun-2001 bmilekic

Introduce numerous SMP friendly changes to the mbuf allocator. Namely,
introduce a modified allocation mechanism for mbufs and mbuf clusters; one
which can scale under SMP and which offers the possibility of resource
reclamation to be implemented in the future. Notable advantages:

o Reduce contention for SMP by offering per-CPU pools and locks.
o Better use of data cache due to per-CPU pools.
o Much less code cache pollution due to excessively large allocation macros.
o Framework for `grouping' objects from same page together so as to be able
to possibly free wired-down pages back to the system if they are no longer
needed by the network stacks.

Additional things changed with this addition:

- Moved some mbuf specific declarations and initializations from
sys/conf/param.c into mbuf-specific code where they belong.
- m_getclr() has been renamed to m_get_clrd() because the old name is really
confusing. m_getclr() HAS been preserved though and is defined to the new
name. No tree sweep has been done "to change the interface," as the old
name will continue to be supported and is not depracated. The change was
merely done because m_getclr() sounds too much like "m_get a cluster."
- TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and
systat(1) (see TODO below).
- Fixed systat(1) to display number of "free mbufs" based on new per-CPU
stat structures.
- Fixed netstat(1) to display new per-CPU stats based on sysctl-exported
per-CPU stat structures. All infos are fetched via sysctl.

TODO (in order of priority):

- Re-enable mbtypes statistics in both netstat(1) and systat(1) after
introducing an SMP friendly way to collect the mbtypes stats under the
already introduced per-CPU locks (i.e. hopefully don't use atomic() - it
seems too costly for a mere stat update, especially when other locks are
already present).
- Optionally have systat(1) display not only "total free mbufs" but also
"total free mbufs per CPU pool."
- Fix minor length-fetching issues in netstat(1) related to recently
re-enabled option to read mbuf stats from a core file.
- Move reference counters at least for mbuf clusters into an unused portion
of the cluster itself, to save space and need to allocate a counter.
- Look into introducing resource freeing possibly from a kproc.

Reviewed by (in parts): jlemon, jake, silby, terry
Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha)
Preliminary performance measurements: jlemon (and me, obviously)
URL: http://people.freebsd.org/~bmilekic/mb_alloc/


78521 20-Jun-2001 jhb

Don't lock around swap_pager_swap_init() that is only called once during
the pagedaemon's startup code since it calls malloc which results in lock
order reversals.


78481 20-Jun-2001 jhb

Put the scheduler, vmdaemon, and pagedaemon kthreads back under Giant for
now. The proc locking isn't actually safe yet and won't be until the proc
locking is finished.


78099 11-Jun-2001 dillon

Cleanup the tabbing


77948 09-Jun-2001 dillon

Two fixes to the out-of-swap process termination code. First, start killing
processes a little earlier to avoid a deadlock. Second, when calculating
the 'largest process' do not just count RSS. Instead count the RSS + SWAP
used by the process. Without this the code tended to kill small
inconsequential processes like, oh, sshd, rather then one of the many
'eatmem 200MB' I run on a whim :-). This fix has been extensively tested on
-stable and somewhat tested on -current and will be MFCd in a few days.

Shamed into fixing this by: ps


77604 01-Jun-2001 tmm

Change the way information about swap devices is exported to be more
canonical: define a versioned struct xswdev, and add a sysctl node
handler that allows the user to get this structure for a certain device
index by specifying this index as last element of the MIB.
This new node handler, vm.swap_info, replaces the old vm.nswapdev
and vm.swapdevX.* (where X was the index) sysctls.


77582 01-Jun-2001 tmm

Clean up the code exporting interrupt statistics via sysctl a bit:
- move the sysctl code to kern_intr.c
- do not use INTRCNT_COUNT, but rather eintrcnt - intrcnt to determine
the length of the intrcnt array
- move the declarations of intrnames, eintrnames, intrcnt and eintrcnt
from machine-dependent include files to sys/interrupt.h
- remove the hw.nintr sysctl, it is not needed.
- fix various style bugs

Requested by: bde
Reviewed by: bde (some time ago)


77398 29-May-2001 jhb

Don't hold the VM lock across VOP's and other things that can sleep.


77139 24-May-2001 jhb

Stick VM syscalls back under Giant if the BLEED option is not defined.


77115 24-May-2001 dillon

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


77094 23-May-2001 jhb

- Assert Giant is held in the vnode pager methods.
- Lock the VM while walking down a vm_object's backing_object list in
vnode_pager_lock().


77093 23-May-2001 jhb

- Add in several asserts of vm_mtx.
- Assert Giant in vm_pageout_scan() for the vnode hacking that it does.
- Don't hold vm_mtx around vget() or vput().
- Lock Giant when calling vm_pageout_scan() from the pagedaemon. Also,
lock curproc while setting the P_BUFEXHAUST flag.
- For now we still hold Giant for all of the vm_daemon. When process
limits are locked we will be only need Giant for swapout_procs().


77091 23-May-2001 jhb

- Assert that the vm lock is held for all of _vm_object_allocate().
- Restore the previous order of setting up a new vm_object. The previous
had a small bug where we zero'd out the flags after we set the
OBJ_ONEMAPPING flag.
- Add several asserts of vm_mtx.
- Assert Giant is held rather than locking and unlocking it in a few
places.
- Add in some #ifdef objlocks code to lock individual vm objects when
vm objects each have their own lock someday.
- Don't bother acquiring the allproc lock for a ddb command. If DDB
blocked on the lock, that would be worse than having an inconsistent
allproc list.


77090 23-May-2001 jhb

- Add lots of vm_mtx assertions.
- Add a few KTR tracepoints to track the addition and removal of
vm_map_entry's and the creation adn free'ing of vmspace's.
- Adjust a few portions of code so that we update the process' vmspace
pointer to its new vmspace before freeing the old vmspace.


77089 23-May-2001 jhb

- Lock the VM around the pmap_swapin_proc() call in faultin().
- Don't lock Giant in the scheduler() function except for when calling
faultin().
- In swapout_procs(), lock the VM before the proccess to avoid a lock order
violation.
- In swapout_procs(), release the allproc lock before calling swapout().
We restart the process scan after swapping out a process.
- In swapout_procs(), un #if 0 the code to bump the vmspace reference count
and lock the process' vm structures. This bug was introduced by me and
could result in the vmspace being free'd out from under a running
process.
- Fix an old bug where the vmspace reference count was not free'd if we
failed the swap_idle_threshold2 test.


77088 23-May-2001 jhb

- Fix the sw_alloc_interlock to actually lock itself when the lock is
acquired.
- Assert Giant is held in the strategy, getpages, and putpages methods and
the getchainbuf, flushchainbuf, and waitchainbuf functions.
- Always call flushchainbuf() w/o the VM lock.


77087 23-May-2001 jhb

Assert Giant is held for the device pager alloc and getpages methods since
we call the mmap method of the cdevsw of the device we are mmap'ing.


77083 23-May-2001 jhb

- Obtain Giant in mmap() syscall while messing with file descriptors and
vnodes.
- Fix an old bug that would leak a reference to a fd if the vnode being
mmap'd wasn't of type VREG or VCHR.
- Lock Giant in vm_mmap() around calls into the VM that can call into
pager routines that need Giant or into other VM routines that need
Giant.
- Replace code that used a goto to jump around the else branch of a test
to use an else branch instead.


77080 23-May-2001 jhb

Acquire Giant around vm_map_remove() inside of the obreak() syscall for
vm_object_terminate().


77077 23-May-2001 jhb

Take a more conservative approach and still lock Giant around VM faults
for now.


77062 23-May-2001 jhb

Set the phys_pager_alloc_lock to 1 when it is acquired so that it is
actually locked.


77036 23-May-2001 alfred

aquire Giant when playing with the buffercache and doing IO.
use msleep against the vm mutex while waiting for a page IO to complete.


77010 22-May-2001 alfred

aquire vm mutex in swp_pager_async_iodone. Don't call swp_pager_async_iodone
with the mutex held.


76981 22-May-2001 jhb

Remove duplicate include and sort includes.


76978 22-May-2001 jhb

Sort includes.


76974 22-May-2001 jhb

Unlock the VM lock at the end of munlock() instead of locking it again.


76973 22-May-2001 jhb

Sort includes from previous commit.


76949 22-May-2001 jhb

Sort includes.


76827 19-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


76778 18-May-2001 jhb

- Use a timeout for the tsleep in scheduler() instead of having vmmeter()
wakeup proc0 by hand to enforce the timeout.
- When swapping out a process, keep the process locked via the proc lock
from the first checks up until we clear PS_INMEM and set PS_SWAPPING in
swapout(). The swapout() function now must be called with the proc lock
held and releases it before returning.
- Comment out the code to attempt to lock a process' VM structures before
swapping out. It is broken in that it releases the lock after obtaining
it. If it does grab the lock, it needs to hand it off to swapout()
instead of releasing it. This can be revisisted when the VM is locked
as this is a valid test to perform. It also causes a lock order reversal
for the time being, which is the immediate cause for temporarily
disabling it.


76773 17-May-2001 jhb

During the code to pick a process to kill when memory is exhausted, keep
the process in question locked as soon as we find it and determine it to
be eligible until we actually kill it. To avoid deadlock, we don't block
on the process lock but skip any process that is already locked during our
search.


76641 15-May-2001 jhb

- Use PROC_LOCK_ASSERT instead of a direct mtx_assert.
- Don't hold Giant in the swapper daemon while we walk the list of
processes looking for a process to swap back in.
- Don't bother grabbing the sched_lock while checking a process' sleep
time in swapout_procs() to ensure that a process has been idle for at
least swap_idle_threshold2 before swapping it out. If we lose the race
we just let a process stay in memory until the next call of
swapout_procs().
- Remove some unneeded spl's, sched_lock does all the locking needed in
this case.


76322 06-May-2001 phk

Actually biofinish(struct bio *, struct devstat *, int error) is more general
than the bioerror().

Most of this patch is generated by scripts.


76244 03-May-2001 markm

Putting sys/lockmgr.h in here allows us to depollute userland includes
a bit.
OK'ed by: bde


76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


76084 27-Apr-2001 alfred

Address a number of problems with sysctl_vm_zone().

The zone allocator's locks should be leaflocks, meaning that they
should never be held when entering into another subsystem, however
the sysctl grabs the zone global mutex and individual zone mutexes
while holding the lock it calls SYSCTL_OUT which recurses into the
VM subsystem in order to wire user memory to do a safe copy. This
can block and cause lock order reversals.

To fix this:
lock zone global.
get a count of the number of zones.
unlock global.
allocate temporary storage.
format and SYSCTL_OUT the banner.
lock global.
traverse list.
make sure we haven't looped more than the initial count taken
to avoid overflowing the allocated buffer.
lock each nodes.
read values and format into buffer.
unlock individual node.
unlock global.
format and SYSCTL_OUT the rest of the data.
free storage.
return.

Other problems included not checking for errors when doing sysctl out
of the column header. Fixed.

Inconsistant termination of the copied string. Fixed.

Objected to by: des (for not using sbuf)

Since the output is not variable length and I'm actually over
allocating signifigantly and I'd like to get this fixed now, I'll
work on the sbuf convertion at a later date. I would not object
to someone else taking it upon themselves to convert it to sbuf.
I hold no MAINTIANER rights to this code (for now).


75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


75692 19-Apr-2001 alfred

vnode_pager_freepage() is really vm_page_free() in disguise,
nuke vnode_pager_freepage() and replace all calls to it with vm_page_free()


75675 18-Apr-2001 alfred

Protect pager object creation with sx locks.

Protect pager object list manipulation with a mutex.

It doesn't look possible to combine them under a single sx lock because
creation may block and we can't have the object list manipulation block
on anything other than a mutex because of interrupt requests.


75644 18-Apr-2001 alfred

Fix the botched rev 1.59 where I made it such that without INVARIANTS
the map is never locked.

Submitted by: tegge


75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


75523 15-Apr-2001 alfred

use TAILQ_FOREACH, fix a comment's location


75477 13-Apr-2001 alfred

if/panic -> KASSERT


75474 13-Apr-2001 alfred

protect pbufs and associated counts with a mutex


75473 13-Apr-2001 alfred

use %p for pointer printf, include sys/systm.h for printf proto


75462 13-Apr-2001 alfred

Use a macro wrapper over printf along with KASSERT to reduce the amount
of code here.


75452 12-Apr-2001 alfred

remove truncated part from commment


74927 28-Mar-2001 jhb

Convert the allproc and proctree locks from lockmgr locks to sx locks.


74914 28-Mar-2001 jhb

Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>


74670 23-Mar-2001 tmm

Export intrnames and intrcnt as sysctls (hw.nintr, hw.intrnames and
hw.intrcnt).

Approved by: rwatson


74237 14-Mar-2001 dillon

Fix a lock reversal problem in the VM subsystem related to threaded
programs. There is a case during a fork() which can cause a deadlock.

From Tor -
The workaround that consists of setting a flag in the vm map that
indicates that a fork is in progress and using that mark in the page
fault handling to force a revalidation failure. That change will only
affect (pessimize) page fault handling during fork for threaded
(linuxthreads style) applications and applications using aio_*().

Submited by: tegge


74235 14-Mar-2001 dillon

Temporarily remove the vm_map_simplify() call from vm_map_insert(). The
call is correct, but it interferes with the massive hack called
vm_map_growstack(). The call will be returned after our stack handling
code is fixed.

Reported by: tegge


74042 09-Mar-2001 iedowse

When creating a shadow vm_object in vmspace_fork(), only one
reference count was transferred to the new object, but both the
new and the old map entries had pointers to the new object.
Correct this by transferring the second reference.

This fixes a panic that can occur when mmap(2) is used with the
MAP_INHERIT flag.

PR: i386/25603
Reviewed by: dillon, alc


73936 07-Mar-2001 jhb

Unrevert the pmap_map() changes. They weren't broken on x86.

Sense beaten into me by: peter


73903 07-Mar-2001 jhb

Back out the pmap_map() change for now, it isn't completely stable on the
i386.


73862 06-Mar-2001 jhb

- Rework pmap_map() to take advantage of direct-mapped segments on
supported architectures such as the alpha. This allows us to save
on kernel virtual address space, TLB entries, and (on the ia64) VHPT
entries. pmap_map() now modifies the passed in virtual address on
architectures that do not support direct-mapped segments to point to
the next available virtual address. It also returns the actual
address that the request was mapped to.
- On the IA64 don't use a special zone of PV entries needed for early
calls to pmap_kenter() during pmap_init(). This gets us in trouble
because we end up trying to use the zone allocator before it is
initialized. Instead, with the pmap_map() change, the number of needed
PV entries is small enough that we can get by with a static pool that is
used until pmap_init() is complete.

Submitted by: dfr
Debugging help: peter
Tested by: me


73534 04-Mar-2001 alfred

Simplify vm_object_deallocate(), by decrementing the refcount first.
This allows some of the conditionals to be combined.


73282 01-Mar-2001 gallatin

Allocate vm_page_array and vm_page_buckets from the end of the biggest chunk
of memory, rather than from the start.

This fixes problems allocating bouncebuffers on alphas where there is only
1 chunk of memory (unlike PCs where there is generally at least one small
chunk and a large chunk). Having 1 chunk had been fatal, because these
structures take over 13MB on a machine with 1GB of ram. This doesn't leave
much room for other structures and bounce buffers if they're at the front.

Reviewed by: dfr, anderson@cs.duke.edu, silence on -arch
Tested by: Yoriaki FUJIMORI <fujimori@grafin.fujimori.cache.waseda.ac.jp>


73212 28-Feb-2001 dillon

If we intend to make the page writable without requiring another fault,
make sure that PG_NOSYNC is properly set. Previously we only set it
for a write-fault, but this can occur on a read-fault too.
(will be MFCd prior to 4.3 freeze)


72949 23-Feb-2001 rwatson

Introduce per-swap area accounting in the VM system, and export
this information via the vm.nswapdev sysctl (number of swap areas)
and vm.swapdevX nodes (where X is the device), which contain the MIBs
dev, blocks, used, and flags. These changes are required to allow
top and other userland swap-monitoring utilities to run without
setgid kmem.

Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: freebsd-audit


72888 22-Feb-2001 des

Fix formatting bugs introduced in sysctl_vm_zone() by the previous commit.
Also, if SYSCTL_OUT() returns a non-zero value, stop at once.


72376 12-Feb-2001 jake

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


71983 04-Feb-2001 dillon

This commit represents work mainly submitted by Tor and slightly modified
by myself. It solves a serious vm_map corruption problem that can occur
with the buffer cache when block sizes > 64K are used. This code has been
heavily tested in -stable but only tested somewhat on -current. An MFC
will occur in a few days. My additions include the vm_map_simplify_entry()
and minor buffer cache boundry case fix.

Make the buffer cache use a system map for buffer cache KVM rather then a
normal map.

Ensure that VM objects are not allocated for system maps. There were cases
where a buffer map could wind up with a backing VM object -- normally
harmless, but this could also result in the buffer cache blocking in places
where it assumes no blocking will occur, possibly resulting in corrupted
maps.

Fix a minor boundry case in the buffer cache size limit is reached that
could result in non-optimal code.

Add vm_map_simplify_entry() calls to prevent 'creeping proliferation'
of vm_map_entry's in the buffer cache's vm_map. Previously only a simple
linear optimization was made. (The buffer vm_map typically has only a
handful of vm_map_entry's. This stabilizes it at that level permanently).

PR: 20609
Submitted by: (Tor Egge) tegge


71610 25-Jan-2001 jhb

- Doh, lock faultin() with proc lock in scheduler().
- Lock p_swtime with sched_lock in scheduler() as well.


71576 24-Jan-2001 jasone

Convert all simplelocks to mutexes and remove the simplelock implementations.


71574 24-Jan-2001 jhb

Argh, I didn't get this test right when I converted it. Break this up
into two separate if's instead of nested if's. Also, reorder things
slightly to avoid unnecessary mutex operations.


71572 24-Jan-2001 jhb

- Catch up to proc flag changes.
- Minimal proc locking.
- Use queue macros.


71571 24-Jan-2001 jhb

Add mtx_assert()'s to verify that kmem_alloc() and kmem_free() are called
with Giant held.


71570 24-Jan-2001 jhb

- Catch up to proc flag changes.
- Proc locking in a few places.
- faultin() now must be called with the proc lock held.
- Split up swappable() into a couple of tests so that it can be locke in
swapout_procs().
- Use queue macros.


71569 24-Jan-2001 jhb

- Catch up to proc flag changes.


71512 24-Jan-2001 jhb

Add missing include.


71429 23-Jan-2001 ume

Add mibs to hold the number of forks since boot. New mibs are:

vm.stats.vm.v_forks
vm.stats.vm.v_vforks
vm.stats.vm.v_rforks
vm.stats.vm.v_kthreads
vm.stats.vm.v_forkpages
vm.stats.vm.v_vforkpages
vm.stats.vm.v_rforkpages
vm.stats.vm.v_kthreadpages

Submitted by: Paul Herman <pherman@frenchfries.net>
Reviewed by: alfred


71408 23-Jan-2001 jake

Sigh. atomic_add_int takes a pointer, not an integer.

Pointy-hat-to: des


71406 23-Jan-2001 des

Use atomic operations to update the stat counters.


71362 22-Jan-2001 des

Call vm_zone_init() at the appropriate time.

Reviewed by: jasone, jhb


71361 22-Jan-2001 des

Give this code a major facelift:

- replace the simplelock in struct vm_zone with a mutex.

- use a proper SLIST rather than a hand-rolled job for the zone list.

- add a subsystem lock that protects the zone list and the statistics
counters.

- merge _zalloc() into zalloc() and _zfree() into zfree(), and
move them below _zget() so there's no need for a prototype.

- add two initialization functions: one which initializes the
subsystem mutex and the zone list, and one that currently doesn't
do anything.

- zap zerror(); use KASSERTs instead.

- dike out half of sysctl_vm_zone(), which was mostly trying to do
manually what the snprintf() call could do better.

Reviewed by: jhb, jasone


71350 21-Jan-2001 des

First step towards an MP-safe zone allocator:
- have zalloc() and zfree() always lock the vm_zone.
- remove zalloci() and zfreei(), which are now redundant.

Reviewed by: bmilekic, jasone


70480 29-Dec-2000 alfred

fix comment which was outdated 3 years ago
remove useless assignment
purge entire file of 'register' keyword


70478 29-Dec-2000 alfred

clean up kmem_suballoc():
remove useless assignment
remove 'register' variables


70390 27-Dec-2000 assar

Make zalloc and zfree non-inline functions. This avoids having to
have the code calling these be compiled with the same setting for
INVARIANTS and SMP.

Reviewed by: dillon


70374 26-Dec-2000 dillon

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


70160 18-Dec-2000 phk

Fix floppy drives on machines with lots of RAM.

The fix works by reverting the ordering of free memory so that the
chances of contig_malloc() succeeding increases.

PR: 23291
Submitted by: Andrew Atrens <atrens@nortel.ca>


69972 13-Dec-2000 tanimura

- If swap metadata does not fit into the KVM, reduce the number of
struct swblock entries by dividing the number of the entries by 2
until the swap metadata fits.

- Reject swapon(2) upon failure of swap_zone allocation.

This is just a temporary fix. Better solutions include:
(suggested by: dillon)

o reserving swap in SWAP_META_PAGES chunks, and
o swapping the swblock structures themselves.

Reviewed by: alfred, dillon


69947 13-Dec-2000 jake

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


69847 11-Dec-2000 dillon

Be less conservative with a recently added KASSERT. Certain edge
cases with file fragments and read-write mmap's can lead to a situation
where a VM page has odd dirty bits, e.g. 0xFC - due to being dirtied by
an mmap and only the fragment (representing a non-page-aligned end of
file) synced via a filesystem buffer. A correct solution that
guarentees consistent m->dirty for the file EOF case is being
worked on. In the mean time we can't be so conservative in the
KASSERT.


69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


69687 06-Dec-2000 alfred

Really fix phys_pager:

Backout the previous delta (rev 1.4), it didn't make any difference.

If the requested handle is NULL then don't add it to the list of
objects, to be found by handle.

The problem is that when asking for a NULL handle you are implying
you want a new object. Because objects with NULL handles were
being added to the list, any further requests for phys backed
objects with NULL handles would return a reference to the initial
NULL handle object after finding it on the list.

Basically one couldn't have more than one phys backed object without
a handle in the entire system without this fix. If you did more
than one shared memory allocation using the phys pager it would
give you your initial allocation again.


69641 05-Dec-2000 alfred

need to adjust allocation size to properly deal with non PAGE_SIZE
allocations, specifically with allocations < PAGE_SIZE when the code
doesn't work properly


69517 02-Dec-2000 bde

Backed out previous commit. Don't depend on namespace pollution in
<sys/buf.h>.


69516 02-Dec-2000 jhb

Protect p_stat with sched_lock.


69509 02-Dec-2000 jhb

Protect p_stat with sched_lock.


69399 30-Nov-2000 alfred

remove unneded sys/ucred.h includes


69022 22-Nov-2000 jake

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


68921 20-Nov-2000 rwatson

o Export dmmax ("Maximum size of a swap block") using SYSCTL_INT.
This removes a reason that systat requires setgid kmem. More to
come.


68885 18-Nov-2000 dillon

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


68884 18-Nov-2000 dillon

Add the splvm()'s suggested in PR 20609 to protect vm_pager_page_unswapped().
The remainder of the PR is still open.

PR: kern/20609 (partial fix)


68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


68261 02-Nov-2000 tegge

Clear the MAP_ENTRY_USER_WIRED flag from cloned vm_map entries.
PR: 2840


67885 29-Oct-2000 phk

Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing
the offending inline function (BUF_KERNPROC) on it being #included
already.

I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).

Remove consequently unneeded #includes of <sys/proc.h>


67536 25-Oct-2000 jhb

- Catch a machine/mutex.h -> sys/mutex.h I somehow missed.
- Close a small race condition. The sched_lock mutex protects
p->p_stat as well as the run queues. Another CPU could change p_stat
of the process while we are waiting for the lock, and we would end up
scheduling a process that isn't runnable.


67247 17-Oct-2000 ps

Implement write combining for crashdumps. This is useful when
write caching is disabled on both SCSI and IDE disks where large
memory dumps could take up to an hour to complete.

Taking an i386 scsi based system with 512MB of ram and timing (in
seconds) how long it took to complete a dump, the following results
were obtained:

Before: After:
WCE TIME WCE TIME
------------------ ------------------
1 141.820972 1 15.600111
0 797.265072 0 65.480465

Obtained from: Yahoo!
Reviewed by: peter


67082 13-Oct-2000 dillon

The swap bitmap allocator was not calculating the bitmap size properly
in the face of non-stripe-aligned swap areas. The bug could cause a
panic during boot.

Refuse to configure a swap area that is too large (67 GB or so)

Properly document the power-of-2 requirement for SWB_NPAGES.

The patch is slightly different then the one Tor enclosed in the P.R.,
but accomplishes the same thing.

PR: kern/20273
Submitted by: Tor.Egge@fast.no


67046 12-Oct-2000 jasone

For lockmgr mutex protection, use an array of mutexes that are allocated
and initialized during boot. This avoids bloating sizeof(struct lock).
As a side effect, it is no longer necessary to enforce the assumtion that
lockinit()/lockdestroy() calls are paired, so the LK_VALID flag has been
removed.

Idea taken from: BSD/OS.


66748 06-Oct-2000 dwmalone

If a process is over its resource limit for datasize, still allow
it to lower its memory usage. This was mentioned on the mailing
lists ages ago, and I've lost the name of the person who brought
it up.

Reviewed by: alc


66615 04-Oct-2000 jasone

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


65904 15-Sep-2000 jhb

- Add a new process flag P_NOLOAD that marks a process that should be
ignored during load average calcuations.
- Set this flag for the idle processes and the softinterrupt process.


65770 12-Sep-2000 bp

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


65557 07-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


65103 26-Aug-2000 obrien

Make the arguments match the functionality of the functions.


63973 28-Jul-2000 peter

Minor cleanups:
- remove unused variables (fix warnings)
- use a more consistant ansi style rather than a mixture
- remove dead #if 0 code and declarations


63897 26-Jul-2000 mckusick

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


62941 11-Jul-2000 alfred

#elsif -> #elif

Noticed by: green


62622 05-Jul-2000 jhb

Support for unsigned integer and long sysctl variables. Update the
SYSCTL_LONG macro to be consistent with other integer sysctl variables
and require an initial value instead of assuming 0. Update several
sysctl variables to use the unsigned types.

PR: 15251
Submitted by: Kelly Yancey <kbyanc@posi.net>


62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


62568 04-Jul-2000 jhb

Replace the PQ_*CACHE options with a single PQ_CACHESIZE option that you
set equal to the number of kilobytes in your cache. The old options are
still supported for backwards compatibility.

Submitted by: Kelly Yancey <kbyanc@posi.net>


62552 04-Jul-2000 mckusick

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


62067 25-Jun-2000 markm

Nifty idea from Jeroen van Gelderen; don't call a routine to check if
we are using the /dev/zero device, just check a flag (supplied by
/dev/zero).
Reviewed by: dfr


61272 05-Jun-2000 hsu

Add missing increment of allocation counter.


61081 29-May-2000 dillon

This is a cleanup patch to Peter's new OBJT_PHYS VM object type
and sysv shared memory support for it. It implements a new
PG_UNMANAGED flag that has slightly different characteristics
from PG_FICTICIOUS.

A new sysctl, kern.ipc.shm_use_phys has been added to enable the
use of physically-backed sysv shared memory rather then swap-backed.
Physically backed shm segments are not tracked with PV entries,
allowing programs which use a large shm segment as a rendezvous
point to operate without eating an insane amount of KVM in the
PV entry management. Read: Oracle.

Peter's OBJT_PHYS object will also allow us to eventually implement
page-table sharing and/or 4MB physical page support for such segments.
We're half way there.


61074 29-May-2000 dfr

Brucify the pmap_enter_temporary() changes.


61058 29-May-2000 dillon

Fix bug in vm_pageout_page_stats() that always resulted in a full
scan of the active queue. This fix is not expected to have any
noticeable impact on performance.

Noticed by: Rik van Riel <riel@conectiva.com.br>


61036 28-May-2000 dfr

Add a new pmap entry point, pmap_enter_temporary() to be used during
dumps to create temporary page mappings. This replaces the use of CADDR1
which is fairly x86 specific.

Reviewed by: dillon


60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


60757 21-May-2000 peter

Checkpoint of a new physical memory backed object type, that does not
have pv_entries. This is intended for very special circumstances,
eg: a certain database that has a 1GB shm segment mapped into 300
processes. That would consume 2GB of kvm just to hold the pv_entries
alone. This would not be used on systems unless the physical ram was
available, as it's not pageable.

This is a work-in-progress, but is a useful and functional checkpoint.
Matt has got some more fixes for it that will be committed soon.

Reviewed by: dillon


60755 21-May-2000 peter

Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly
to various pmap_*() functions instead of looking up the physical address
and passing that. In many cases, the first thing the pmap code was doing
was going to a lot of trouble to get back the original vm_page_t, or
it's shadow pv_table entry.

Inspired by: John Dyson's 1998 patches.

Also:
Eliminate pv_table as a seperate thing and build it into a machine
dependent part of vm_page_t. This eliminates having a seperate set of
structions that shadow each other in a 1:1 fashion that we often went to
a lot of trouble to translate from one to the other. (see above)
This happens to save 4 bytes of physical memory for each page in the
system. (8 bytes on the Alpha).

Eliminate the use of the phys_avail[] array to determine if a page is
managed (ie: it has pv_entries etc). Store this information in a flag.
Things like device_pager set it because they create vm_page_t's on the
fly that do not have pv_entries. This makes it easier to "unmanage" a
page of physical memory (this will be taken advantage of in subsequent
commits).

Add a function to add a new page to the freelist. This could be used
for reclaiming the previously wasted pages left over from preloaded
loader(8) files.

Reviewed by: dillon


60557 14-May-2000 dillon

Fixed bug in madvise() / MADV_WILLNEED. When the request is offset
from the base of the first map_entry the call to pmap_object_init_pt()
uses the wrong start VA. MFC to follow.

PR: i386/18095


60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


59915 03-May-2000 phk

Convert the vm_pager_strategy() interface to take a struct bio instead of
a struct buf. Don't try to examine B_ASYNC, it is a layering violation
to do so. The only current user of this interface is vn(4) which, since
it emulates a disk interface, operates on struct bio already.


59866 01-May-2000 phk

Move and staticize the bufchain functions so they become local to the
only piece of code using them. This will ease a rewrite of them.


59794 30-Apr-2000 phk

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


59496 22-Apr-2000 wollman

Implement POSIX.1b shared memory objects. In this implementation,
shared memory objects are regular files; the shm_open(3) routine
uses fcntl(2) to set a flag on the descriptor which tells mmap(2)
to automatically apply MAP_NOSYNC.

Not objected to by: bde, dillon, dufault, jasone


59395 19-Apr-2000 alc

vm_object_shadow: Remove an incorrect assertion. In obscure circumstances
vm_object_shadow can be called on an object with ref_count > 1 and
OBJ_ONEMAPPING set. This isn't really a problem for vm_object_shadow.


59368 18-Apr-2000 phk

Remove unneeded <sys/buf.h> includes.

Due to some interesting cpp tricks in lockmgr, the LINT kernel shrinks
by 924 bytes.


59249 15-Apr-2000 phk

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


59017 04-Apr-2000 msmith

Fix _zget() so that it checks the return from kmem_alloc(), to avoid
zttempting to bzero NULL when the kernel map fills up. _zget() will
now return NULL as it seems it was originally intended to do.


58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


58708 27-Mar-2000 dillon

Add necessary spl protection for swapper. The problem was located by
Alfred while testing his SPLASSERT stuff. This is not a complete fix,
more protections are probably needed.


58705 27-Mar-2000 charnier

Revert spelling mistake I made in the previous commit
Requested by: Alan and Bruce


58634 26-Mar-2000 charnier

Spelling


58462 22-Mar-2000 phk

Fix one place which knew that B_WRITE was zero.

Fix a stylistic mistake of mine while here.

Found by: Stephen Hocking <shocking@prth.pgs.com>


58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


58132 16-Mar-2000 phk

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


57975 13-Mar-2000 phk

Remove unused 3rd argument from vsunlock() which abused B_WRITE.


57550 28-Feb-2000 ps

Add MAP_NOCORE to mmap(2), and MADV_NOCORE and MADV_CORE to madvise(2).
This
This feature allows you to specify if mmap'd data is included in
an application's corefile.

Change the type of eflags in struct vm_map_entry from u_char to
vm_eflags_t (an unsigned int).

Reviewed by: dillon,jdp,alfred
Approved by: jkh


57263 16-Feb-2000 dillon

Fix null-pointer dereference crash when the system is intentionally
run out of KVM through a mmap()/fork() bomb that allocates hundreds
of thousands of vm_map_entry structures.

Add panic to make null-pointer dereference crash a little more verbose.

Add a new sysctl, vm.max_proc_mmap, which specifies the maximum number
of mmap()'d spaces (discrete vm_map_entry's in the process). The value
defaults to around 9000 for a 128MB machine. The test is scaled for the
number of processes sharing a vmspace (aka linux threads). Setting
the value to 0 disables the feature.

PR: kern/16573
Approved by: jkh


56599 25-Jan-2000 dillon

The swapdev_vp changes made to rip out the swap specfs interaction
also broke diskless swapping. Moving the swapdev_vp initialization
to more commonly run code solves the problem.

PR: kern/16165
Additional testing by: David Gilbert <dgilbert@velocet.ca>


56378 21-Jan-2000 dillon

Fix a deadlock between msync(..., MS_INVALIDATE) and vm_fault. The
invalidation code cannot wait for paging to complete while holding a
vnode lock, so we don't wait. Instead we simply allow the lower level
code to simply block on any busy pages it encounters. I think Yahoo
may be the only entity in the entire world that actually uses this
msync feature :-).

Bug reported by: Paul Saab <paul@mu.org>


55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


55351 03-Jan-2000 guido

Use MAP_NOSYNC for vnodes without any links in their filesystem.

This is necessary for vmware: it does not use an anonymous mmap for
the memory of the virtual system. In stead it creates a temp file an
unlinks it. For a 50 MB file, this results in a ot of syncing
every 30 seconds.

Reviewed by: Matthew Dillon <dillon@backplane.com>


55206 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


55175 28-Dec-1999 peter

Fix the swap backed vn case - this was broken by my rev 1.128 to
swap_pager.c and related commits.

Essentially swap_pager.c is backed out to before the changes, but
swapdev_vp is converted into a real vnode with just VOP_STRATEGY().
It no longer abuses specfs vnops and no longer needs a dev_t and
/dev/drum (or /dev/swapdev) for the intermediate layer.

This essentially restores the vnode interface as the interface to the
bottom of the swap pager, and vm_swap.c provides a clean vnode interface.

This will need to be revisited when we swap to files (vnodes) - which
is the other reason for keeping the vnode interface between the swap pager
and the swap devices.

OK'ed by: dillon


54655 15-Dec-1999 eivind

Introduce NDFREE (and remove VOP_ABORTOP)


54467 12-Dec-1999 dillon

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


54188 06-Dec-1999 luoqi

User ldt sharing.


53899 29-Nov-1999 phk

Report swapdevices as cdevs rather than bdevs.

Remove unused dev2budev() function.


53701 25-Nov-1999 alc

Remove nonsensical vm_map_{clear,set}_recursive() calls
from vm_map_pageable(). At the point they called, vm_map_pageable()
holds a read (or shared) lock on the map. The purpose
of vm_map_{clear,set}_recursive() is to disable/enable repeated
write (or exclusive) lock requests by the same process.


53627 23-Nov-1999 alc

Correct the following error: vm_map_pageable() on a COW'ed (post-fork)
vm_map always failed because vm_map_lookup() looked at
"vm_map_entry->wired_count" instead of "(vm_map_entry->eflags &
MAP_ENTRY_USER_WIRED)". The effect was that many page
wiring operations by sysctl were (silently) failing.


53594 22-Nov-1999 phk

Isolate the swapdev_vp "not quite" vnode in the only source file which
needs it now that /dev/drum is gone.

Reviewed by: eivind, peter


53338 18-Nov-1999 peter

Remove the non-functional "swap device" userland front-end to the
multiplexed underlying swap devices (/dev/drum). The only thing it did
was to allow root to open /dev/drum, but not do anything with it.
Various utilities used to grovel around in here, but Matt has written
a much nicer (and clean) front-end to this for libkvm, and nothing uses
the old system any more.

The VM system was calling VOP_STRATEGY() on the vp of the first underlying
swap device (not the /dev/drum one, the first real device), and using
the VOP system to indirectly (and only) call swstrategy() to choose
an underlying device and enqueue it on that device. I have changed it
to avoid diverting through the VOP system and to call the only possible
target directly, saving a little bit of time and some complexity.

In all, nothing much changes, except some scaffolding to support the
roundabout way of calling swstrategy() is gone.

Matt gave me the ok to do this some time ago, and I apologize for taking
so long to get around to it.


53074 10-Nov-1999 alc

Two changes: (1) Use vm_page_unqueue_nowakeup in vm_page_alloc
instead of duplicating the code. (2) If a wired page is passed
to vm_page_free_toq, panic instead of printing a friendly warning.
(If we don't panic here, we'll just panic later in vm_page_unwire
obscuring the problem.)


52974 08-Nov-1999 alc

Remove unused declarations.


52973 07-Nov-1999 alc

Remove unused #include's.

Submitted by: phk


52960 07-Nov-1999 alc

The functions declared by this header file no longer exist.

Submitted by: phk (in part)


52649 30-Oct-1999 alc

Reverse the sense of the test in the KASSERT's from the last commit.


52647 30-Oct-1999 alc

The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.

Reviewed by: dillon


52644 30-Oct-1999 phk

Change useracc() and kernacc() to use VM_PROT_{READ|WRITE|EXECUTE} for the
"rw" argument, rather than hijacking B_{READ|WRITE}.

Fix two bugs (physio & cam) resulting by the confusion caused by this.

Submitted by: Tor.Egge@fast.no
Reviewed by: alc, ken (partly)


52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


52617 29-Oct-1999 alc

Remove the last vestiges of "vm_map_t phys_map". It's been unused
since i386/i386/machdep.c rev 1.45 (or 1994 :-) ).


52568 27-Oct-1999 alc

Shrink "struct vm_object" by not spending a full 32 bits
on "objtype_t".


52035 08-Oct-1999 phk

Fix a panic(8) implementation:
hexdump -C < /dev/drum
by simply refusing to do I/O from userland.
a panic. I'm not sure we even need /dev/drum anymore, it seems
to have been broken for a long time thi


51930 04-Oct-1999 phk

Introduce swopen to prevent blockdevice opens and insist on minor==0.


51928 04-Oct-1999 phk

Give the swap device a D_DISK flag against my better judgement.

TODO: add an open routing which fails for bdev opens.


51810 30-Sep-1999 dt

Plug an accounting leak: count pages in ZONE_INTERRUPT zones as wired.


51658 25-Sep-1999 phk

Remove five now unused fields from struct cdevsw. They should never
have been there in the first place. A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags


51493 21-Sep-1999 dillon

cleanup madvise code, add a few more sanity checks.

Reviewed by: Alan Cox <alc@cs.rice.edu>, dg@root.com


51488 21-Sep-1999 dillon

Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.

This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


51474 20-Sep-1999 dillon

Fix bug in pipe code relating to writes of mmap'd but illegal address
spaces which cross a segment boundry in the page table. pmap_kextract()
is not designed for access to the user space portion of the page
table and cannot handle the null-page-directory-entry case.

The fix is to have vm_fault_quick() return a success or failure which
is then used to avoid calling pmap_kextract().


51343 17-Sep-1999 dillon

Remove inappropriate VOP_FSYNC from vm_object_page_clean(). The fsync
syncs the entire underlying file rather then just the requested range,
resulting in huge inefficiencies when the VM system is articulated in
a certain way. The VOP_FSYNC was also found to massively reduce NFS
performance in certain cases.

Change MADV_DONTNEED and MADV_FREE to call vm_page_dontneed() instead
of vm_page_deactivate(). Using vm_page_deactivate() causes all
inactive and cache pages to be recycled before the dontneed/free page
is recycled, effectively flushing our entire VM inactive & cache
queues continuously even if only a few pages are being actively MADV
free'd and reused (such as occurs with a sequential scan of a
memory-mapped file).

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51342 17-Sep-1999 dillon

Add 'lastr' field to vm_map_entry in preparation for its removal
from the vnode. (The changeover is undergoing final testing and
will be committed soon).

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51340 17-Sep-1999 dillon

The vnode pager (used when you do file-backed mmaps) must use the
underlying physical sector size when aligning I/O transfer sizes.
It cannot assume 512 bytes.

We assume the underlying sector size is a power of 2. If it isn't,
mmap() will break badly anyway (in the same way mmap broke with NFS
when NFS tried to cache piecemeal write ranges in buffers, before
we enforced read-buffer-before-write-piecemeal for NFS).

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51339 17-Sep-1999 dillon

Fix a number of spl bugs related to reserving and freeing swap space.
Swap space can be freed from an interrupt and so swap reservation and
freeing must occur at splvm.

Add swap_pager_reserve() code to support a new swap pre-reservation
capability for the VN device.

Generally cleanup the swap code by simplifying the swp_pager_meta_build()
static function and consolidating the SWAPBLK_NONE test from a bit test
to an absolute compare. The bit test was left over from a rejected
swap allocation scheme that was not ultimately committed. A few other
minor cleanups were also made.

Reorganize the swap strategy code, again for VN support, to not
reallocate swap when writing as this messes up pre-reservation and
can fragment I/O unnecessarily as VN-baesd disk is messed around with.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51338 17-Sep-1999 dillon

Add required BUF_KERNPROC to flushchainbuf() to disassociate the
current process from the exclusive lock prior to initiating I/O.

This fixes a panic related to swap-backed VN disks

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


51337 17-Sep-1999 dillon

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>

Replace various VM related page count calculations strewn over the
VM code with inlines to aid in readability and to reduce fragility
in the code where modules depend on the same test being performed
to properly sleep and wakeup.

Split out a portion of the page deactivation code into an inline
in vm_page.c to support vm_page_dontneed().

add vm_page_dontneed(), which handles the madvise MADV_DONTNEED
feature in a related commit coming up for vm_map.c/vm_object.c. This
code prevents degenerate cases where an essentially active page may
be rotated through a subset of the paging lists, resulting in premature
disposal.


50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50405 26-Aug-1999 phk

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


50301 24-Aug-1999 green

When the SYSINIT() was removed, it was replaced with a make_dev on-demand
creation of /dev/drum via calling swapon. However, the make_dev has a
bogus (insofar that it hasn't been added yet) cdevsw, so later we end
up crashing with a null pointer dereference on the swap vp's specinfo.
The specinfo points to a dev_t with a major of 254 (uninitialized), and
we get a crash on its d_strategy being called.

The simple solution to this is to call cdevsw_add before the make_dev
is ever used. This fixes the panic which occurred upon swapping.


50269 23-Aug-1999 bde

Use devtoname to print dev_t's instead of casting them to u_long for
misprinting with %lx.

Cast pointers to intptr_t instead of casting them to long. Cosmetic.


50254 23-Aug-1999 phk

Convert DEVFS hooks in (most) drivers to make_dev().

Diskslice/label code not yet handled.

Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)

Add the correct hook for devfs to kern_conf.c

The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.

A few drivers had minor additional cleanups performed relating to cdevsw
registration.

A few drivers don't register a cdevsw{} anymore, but only use make_dev().


50248 23-Aug-1999 alc

Correct the inconsistent formatting in struct vm_map.

Addendum to rev 1.47: submitted by dillon.


50247 23-Aug-1999 alc

struct vm_map:
The lock structure cannot be the first element of the vm_map
because this can result in livelock between two or more system
processes trying to kmem_alloc_wait.


50136 22-Aug-1999 alc

Remove two unused variable declarations.


50075 20-Aug-1999 alc

vm_page_alloc and contigmalloc1:
Verify that free pages are not dirty.

Submitted by: dillon


50034 19-Aug-1999 peter

Update for run queue code.


49998 18-Aug-1999 mjacob

Fix breakage - an extra brace got inserted where DIAGNOSTIC was defined
but MAP_LOCK_DIAGNOSTIC wasn't.


49991 17-Aug-1999 green

Unbreak the nfs KLD_MODULE. It needs a bit more of vm_page.h than was
exported (notably vm_page_undirty()). Also, let vm_page_dirty() work
in a KLD.


49979 17-Aug-1999 alc

vm_page_free_toq:
Update the comment to reflect the demise of PQ_ZERO and
remove a (now) useless test.


49949 17-Aug-1999 alc

Correct an accidental omission of one "vm_page_undirty" replacement
from the previous commit.


49948 17-Aug-1999 alc

vm_page_free_toq:
Clear the dirty bit mask (vm_page_undirty) before adding the page
to the free page queue.

Submitted by: dillon


49945 17-Aug-1999 alc

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


49937 17-Aug-1999 alc

vm_pageout_clean:
Remove dead code.

Submitted by: dillon


49900 16-Aug-1999 alc

vm_map_lock*:
Remove semicolons or add "do { } while (0)" as necessary
to enable the use of these macros in arbitrary statements.
(There are no functional changes.)

Submitted by: dillon


49858 15-Aug-1999 alc

Remove the declarations for "vm_map_t io_map". It's been unused
since i386/i386/machdep rev 1.310, i.e., the demise of BOUNCE_BUFFERS.


49852 15-Aug-1999 alc

Remove the declarations for "vm_map_t u_map". It's been unused
since i386/i386/pmap rev 1.190. (The alpha never used it.)


49819 15-Aug-1999 alc

contigmalloc1 (currently) depends on PQ_FREE and PQ_CACHE not being 0
to tell a valid "struct vm_page" from an invalid one in the vm_page_array.
This isn't a very robust method.


49813 15-Aug-1999 mjacob

Add back in old definitions if we're compiling for alpha.


49720 14-Aug-1999 alc

Don't create a "struct vpgqueues" for PQ_NONE.


49697 13-Aug-1999 alc

vm_map_madvise:
A complete rewrite by dillon and myself to separate
the implementation of behaviors that effect the vm_map_entry
from those that effect the vm_object.

A result of this change is that madvise(..., MADV_FREE);
is much cheaper.


49679 13-Aug-1999 phk

The bdevsw() and cdevsw() are now identical, so kill the former.


49666 12-Aug-1999 alc

Make the default page coloring parameters match a (non-Xeon) Pentium II/III.

This setting is also acceptable for Celerons and Pentium Pros
with less than 1MB L2 caches.

Note: PQ_L2_SIZE is a misnomer. The correct number of colors is
a function of the cache's degree of associativity as well as its size.

Submitted by: bde and alc


49655 12-Aug-1999 alc

vm_object_madvise:
Update the comments to match the implementation.

Submitted by: dillon


49654 12-Aug-1999 alc

vm_object_madvise:
Support MADV_DONTNEED and MADV_WILLNEED on object types
besides OBJT_DEFAULT and OBJT_SWAP.

Submitted by: dillon


49618 11-Aug-1999 alc

contigmalloc1:
If a page is found in the wrong queue, panic instead
of silently ignoring the problem.


49615 10-Aug-1999 peter

Add a contigfree() as a corollary to contigmalloc() as it's not clear
which free routine to use and people are tempted to use free() (which
doesn't work)


49592 10-Aug-1999 alc

vm_map_madvise:
Now that behaviors are stored in the vm_map_entry rather than
the vm_object, it's no longer necessary to instantiate a vm_object
just to hold the behavior.

Reviewed by: dillon


49558 09-Aug-1999 phk

Merge the cons.c and cons.h to the best of my ability. alpha may or
may not compile, I can't test it.


49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


49338 01-Aug-1999 alc

Move the memory access behavior information provided by madvise
from the vm_object to the vm_map.

Submitted by: dillon


49326 31-Jul-1999 alc

Change the type of vpgqueues::lcnt from "int *" to "int". The indirection
served no purpose.


49305 31-Jul-1999 alc

vm_page_queue_init:
Remove the initialization of PQ_NONE's cnt and lcnt. They aren't
used.

vm_page_insert:
Remove an unnecessary dereference.

vm_page_wire:
Remove the one and only (and thus pointless) reference
to PQ_NONE's lcnt.


48974 22-Jul-1999 alc

Reduce the number of "magic constants" used for page coloring
by one: PQ_PRIME2 and PQ_PRIME3 are used to accomplish the same
thing at different places in the kernel. Drop PQ_PRIME3.


48963 21-Jul-1999 alc

Fix the following problem:

When creating new processes (or performing exec), the new page
directory is initialized too early. The kernel might grow before
p_vmspace is initialized for the new process. Since pmap_growkernel
doesn't yet know about the new page directory, it isn't updated, and
subsequent use causes a failure.

The fix is (1) to clear p_vmspace early, to stop pmap_growkernel
from stomping on memory, and (2) to defer part of the initialization
of new page directories until p_vmspace is initialized.

PR: kern/12378
Submitted by: tegge
Reviewed by: dfr


48948 20-Jul-1999 green

Make a dev2budev() function, and use it. This refixes pstat (working, broken,
working, broken, working) and savecore (working, working, broken, working,
working).

Sorta Reviewed by: phk


48922 20-Jul-1999 alc

Convert a "page not busy" warning to an assertion.

Submitted by: dillon@backplane.com


48866 17-Jul-1999 phk

Add a field to struct swdevt to avoid a bogus udev2dev() call.


48859 17-Jul-1999 phk

I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name. s/umakedev/makeudev/g


48833 16-Jul-1999 alc

Remove vm_object::last_read. It is used by the old swap pager, but
not by the new one, i.e., vm/swap_pager.c rev 1.108.

Reviewed by: dillon@backplane.com


48757 11-Jul-1999 alc

Cleanup OBJ_ONEMAPPING management.

vm_map.c:
Don't set OBJ_ONEMAPPING on arbitrary vm objects. Only default
and swap type vm objects should have it set. vm_object_deallocate
already handles these cases.

vm_object.c:
If OBJ_ONEMAPPING isn't already clear in vm_object_shadow,
we are in trouble. Instead of clearing it, make it
an assertion that it is already clear.


48738 10-Jul-1999 alc

Change the data type used to represent page color in the vm_object
to be the same as that used in the vm_page. (This change also
shrinks the vm_object.)


48736 10-Jul-1999 alc

Remove unused function prototypes.


48658 07-Jul-1999 ache

add unused argument to udev2dev() to make kernel compiled


48652 07-Jul-1999 msmith

Reinstate the previous fix for the broken export of a dev_t in sw_dev, convert
back to a dev_t when the value is actually used.


48651 07-Jul-1999 green

Back out previous commit. It was wrong, and caused panics.


48647 06-Jul-1999 msmith

swdevt should contain a udev_t not a devt. This resulted in bogus
swap device name reporting.

Submitted by: Bill Swingle <unfurl@freebsd.org>


48590 05-Jul-1999 mckay

Reformat previous fix to remove an uglier than average goto.

Looked OK to: dg


48544 04-Jul-1999 mckusick

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


48409 01-Jul-1999 peter

Fix some int/long printf problems for the Alpha


48391 01-Jul-1999 peter

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


48289 27-Jun-1999 peter

Kirk missed a required BUF_KERNPROC(). Even though this is a non-async
transfer, the b_iodone hook causes biodone() to release it from interrupt
context.


48274 27-Jun-1999 peter

Minor tweaks to make sure (new) prerequisites for <sys/buf.h> (mostly
splbio()/splx()) are #included in time.


48252 26-Jun-1999 peter

There isn't much point waking up a daemon that hasn't existed since
softupdates came in. Try calling speedup_syncer() instead..


48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


48099 22-Jun-1999 alc

Remove (1) "extern" declarations for variables that were previously
made "static" and (2) initialized but unused variables.


48059 20-Jun-1999 alc

Remove vm_object::cache_count and vm_object::wired_count. They are
not used. (Nor is there any planned use by John who introduced them.)

Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>


48045 20-Jun-1999 alc

Set cnt.v_page_size to PAGE_SIZE rather than DEFAULT_PAGE_SIZE so that
"vmstat -s" reports the correct value on the Alpha.

Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>


48022 19-Jun-1999 alc

Remove some unused function and variable declarations.


47986 17-Jun-1999 alc

vm_map_growstack uses vmspace::vm_ssize as though it contained
the stack size in bytes when in fact it is the stack size in pages.


47968 17-Jun-1999 alc

vm_map_insert sometimes extends an existing vm_map entry, rather than
creating a new entry. vm_map_stack and vm_map_growstack can panic when
a new entry isn't created. Fixed vm_map_stack and vm_map_growstack.

Also, when extending the stack, always set the protection to VM_PROT_ALL.


47966 17-Jun-1999 alc

Move vm_map_stack and vm_map_growstack after the definition
of the vm_map_clip_end macro. (The next commit will modify
vm_map_stack and vm_map_growstack to use vm_map_clip_end.)


47965 17-Jun-1999 alc

Remove some unused declarations and duplicate initialization.


47888 12-Jun-1999 alc

vm_map_protect:
The wrong vm_map_entry is used to determine if writes must not be
allowed due to COW.


47841 08-Jun-1999 dt

Add a function kmem_alloc_nofault() - same as kmem_alloc_pageable(), but
create a nofault entry. It will be used to allocate kmem for upages.

(I am not too happy with all this, but it's better than nothing).


47765 05-Jun-1999 alc

vm_mmap:
Insure that device mappings get MAP_PREFAULT(_PARTIAL) set,
so that 4M page mappings are used when possible.

Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>


47673 01-Jun-1999 phk

Shorten a detour around dev_t to get a udev_t created.


47640 31-May-1999 phk

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


47625 30-May-1999 phk

This commit should be a extensive NO-OP:

Reformat and initialize correctly all "struct cdevsw".

Initialize the d_maj and d_bmaj fields.

The d_reset field was not removed, although it is never used.

I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.

Vinum and i4b not modified, patches emailed to respective authors.


47607 30-May-1999 alc

Addendum to 1.155. Verify the existence of the object before checking
its reference count.


47568 28-May-1999 alc

Avoid the creation of unnecessary shadow objects.


47290 18-May-1999 alc

vm_map_insert:
General cleanup. Eliminate coalescing checks that are duplicated
by vm_object_coalesce.


47258 17-May-1999 alc

Add the options MAP_PREFAULT and MAP_PREFAULT_PARTIAL to vm_map_find/insert,
eliminating the need for the pmap_object_init_pt calls in imgact_* and
mmap.

Reviewed by: David Greenman <dg@root.com>


47243 16-May-1999 alc

Remove prototypes for functions that don't exist anymore (vm_map.h).

Remove a useless argument from vm_map_madvise's interface (vm_map.c,
vm_map.h, and vm_mmap.c).

Remove a redundant test in vm_uiomove (vm_map.c).

Make two changes to vm_object_coalesce:

1. Determine whether the new range of pages actually overlaps
the existing object's range of pages before calling vm_object_page_remove.
(Prior to this change almost 90% of the calls to vm_object_page_remove
were to remove pages that were beyond the end of the object.)

2. Free any swap space allocated to removed pages.


47239 15-May-1999 dt

Fix confusion of size of transfer with size of the pager.

PR: 11658
Broken in: 1.89 (1998/03/07)


47207 14-May-1999 alc

Simplify vm_map_find/insert's interface: remove the MAP_COPY_NEEDED option.

It never makes sense to specify MAP_COPY_NEEDED without also specifying
MAP_COPY_ON_WRITE, and vice versa. Thus, MAP_COPY_ON_WRITE suffices.

Reviewed by: David Greenman <dg@root.com>


47111 13-May-1999 bde

Casting handles from void * to uintptr_t on the way to dev_t became
especially bogus when dev_t became a pointer.


47094 13-May-1999 luoqi

Device pager's handle is dev_t not udev_t.


47064 12-May-1999 phk

Fix a udev_t/dev_t mismatch which prevent paging from working.


47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


46816 09-May-1999 phk

No point in swapdev being a static global when used only locally.


46676 08-May-1999 phk

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


46635 07-May-1999 phk

Continue where Julian left off in July 1998:

Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.

Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)

Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)


46625 07-May-1999 phk

Introduce two functions: physread() and physwrite() and use these directly
in *devsw[] rather than the 46 local copies of the same functions.

(grog will do the same for vinum when he has time)


46592 06-May-1999 peter

Add brackets to silence egcs and help clarity.


46580 06-May-1999 phk

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


46538 06-May-1999 luoqi

Don't ignore mmap() address hint below the text section.


46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


46349 02-May-1999 alc

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


46153 28-Apr-1999 dt

s/static foo_devsw_installed = 0;/static int foo_devsw_installed;/.
(Edited automatically)


46112 27-Apr-1999 phk

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


45960 23-Apr-1999 dt

Make pmap_collect() an official pmap interface.


45821 19-Apr-1999 peter

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


45665 13-Apr-1999 peter

Move the declaration of faultin() from the vm headers to proc.h, since
it is now referenced from a macro there (PHOLD()).


45567 11-Apr-1999 eivind

Staticize


45561 10-Apr-1999 dt

Convert usage of vm_page_bits() to the new convention ("Inputs are required
to range within a page").


45550 10-Apr-1999 eivind

Lock vnode correctly for VOP_OPEN.

Discussed with: alc, dillon


45365 06-Apr-1999 peter

Don't forcibly kill processes that are locked in-core via PHOLD - it was
just checking P_NOSWAP before.


45363 06-Apr-1999 peter

Only use p->p_lock (manage by PHOLD()/PRELE()) - P_NOSWAP/P_PHYSIO is no
longer set.


45347 05-Apr-1999 julian

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


45293 04-Apr-1999 alc

Two changes to vm_map_delete:

1. Don't bother checking object->ref_count == 1 in order to set
OBJ_ONEMAPPING. It's a waste of time. If object->ref_count == 1,
vm_map_entry_delete will "run-down" the object and its pages.

2. If object->ref_count == 1, ignore OBJ_ONEMAPPING. Wait for
vm_map_entry_delete to "run-down" the object and its pages.
Otherwise, we're calling two different procedures to delete
the object's pages.

Note: "vmstat -s" will once again show a non-zero value
for "pages freed by exiting processes".


45069 27-Mar-1999 alc

Mainly, eliminate the comments about share maps. (We don't have share maps
any more.) Also, eliminate an incorrect comment that says that we don't
coalesce vm_map_entry's. (We do.)


45057 27-Mar-1999 eivind

Correct a comment.


44928 21-Mar-1999 alc

Two changes:

Remove more (redundant) map timestamp increments from properly
synchronized routines. (Changed: vm_map_entry_link, vm_map_entry_unlink,
and vm_map_pageable.)

Micro-optimize vm_map_entry_link and vm_map_entry_unlink, eliminating
unnecessary dereferences. At the same time, converted them from macros
to inline functions.


44880 19-Mar-1999 alc

Construct the free queue(s) in descending order (by physical
address) so that the first 16MB of physical memory is allocated
last rather than first. On large-memory machines, this avoids
the exhaustion of low physical memory before isa_dmainit has run.


44793 16-Mar-1999 alc

Correct a problem in kmem_malloc: A kmem_malloc allowing "wait" may
block (VM_WAIT) holding the map lock. This is bad. For example, a subsequent
kmem_malloc by an interrupt handler on the same map may find the lock held
and panic in the lockmgr.


44773 15-Mar-1999 alc

Two changes:

In general, vm_map_simplify_entry should be performed INSIDE
the loop that traverses the map, not outside. (Changed:
vm_map_inherit, vm_map_pageable.)

vm_fault_unwire doesn't acquire the map lock (or block holding
it). Thus, vm_map_set/clear_recursive shouldn't be called.
(Changed: vm_map_user_pageable, vm_map_pageable.)


44771 15-Mar-1999 julian

Fix breakage in last commit
Submitted by: Brian Feldman <green@unixhelp.org>


44754 14-Mar-1999 julian

A bit of a hack, but allows the vn device to be a module again.

Submitted by: Matt Dillon <dillon@freebsd.org>


44739 14-Mar-1999 julian

Submitted by: Matt Dillon <dillon@freebsd.org>
The old VN device broke in -4.x when the definition of B_PAGING
changed. This patch fixes this plus implements additional capabilities.
The new VN device can be backed by a file ( as per normal ), or it can
be directly backed by swap.

Due to dependencies in VM include files (on opt_xxx options) the new
vn device cannot be a module yet. This will be fixed in a later commit.
This commit delimitted by tags {PRE,POST}_MATT_VNDEV


44733 14-Mar-1999 alc

Correct two optimization errors in vm_object_page_remove:

1. The size of vm_object::memq is vm_object::resident_page_count,
not vm_object::size.

2. The "size > 4" test sometimes results in the traversal of a ~1000 page
memq in order to locate ~10 pages.


44682 12-Mar-1999 alc

Remove vm_page_frees from kmem_malloc that are performed
by vm_map_delete/vm_object_page_remove anyway.


44675 12-Mar-1999 julian

Stop the mfs from trying to swap out crucial bits of the mfs
as this can lead to deadlock.
Submitted by: Mat dillon <dillon@freebsd.org>


44597 09-Mar-1999 alc

Remove (redundant) map timestamp increments from some properly
synchronized routines.


44569 08-Mar-1999 alc

Remove an unused variable from vmspace_fork.


44565 07-Mar-1999 alc

Change vm_map_growstack to acquire and hold a read lock (instead of a write
lock) until it actually needs to modify the vm_map.

Note: it is legal to modify vm_map::hint without holding a write lock.

Submitted by: "Richard Seaman, Jr." <dick@tar.com> with minor changes
by myself.


44513 06-Mar-1999 alc

Upgrading a map's lock to exclusive status should increment
the map's timestamp. In general, whenever an exclusive lock is
acquired the timestamp should be incremented.


44438 02-Mar-1999 alc

To avoid a conflict for the vm_map's lock with vm_fault, release
the read lock around the subyte operations in mincore. After the lock is
reacquired, use the map's timestamp to determine if we need to restart
the scan.


44396 02-Mar-1999 alc

Remove the last of the share map code: struct vm_map::is_main_map.

Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>


44379 01-Mar-1999 alc

mincore doesn't modify the vm_map. Therefore, it doesn't require
an exclusive lock. A read lock will suffice.


44321 27-Feb-1999 alc

Reviewed by: "John S. Dyson" <dyson@iquest.net>
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
To prevent a deadlock, if we are extremely low on memory, force synchronous
operation by the VOP_PUTPAGES in vnode_pager_putpages.


44250 25-Feb-1999 alc

Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Corrected the computation of cnt.v_ozfod in vm_fault: vm_fault
was counting the number of unoptimized rather than optimized
zero-fill faults.


44249 25-Feb-1999 dillon

Comment swstrategy() routine.


44245 24-Feb-1999 dillon

Remove unnecessary page protects on map_split and collapse operations.
Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do
not get set under certain circumstances ( page rename case ).

Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson


44206 22-Feb-1999 dillon

Removed ENOMEM error on swap_pager_full condition which ignored the
availability of physical memory. As per original bug report by
Bruce.

Reviewed by: Alan Cox <alc@cs.rice.edu>


44179 21-Feb-1999 dillon

Remove conditional sysctl's

Leave swap_async_max sysctl intact, remove swap_cluster_max sysctl.

Reviewed by: Alan Cox <alc@cs.rice.edu>


44178 21-Feb-1999 dillon

Reviewed by: Alan Cox <alc@cs.rice.edu>

Fix problem w/ low-swap/low-memory handling as reported by Bruce Evans.


44156 19-Feb-1999 luoqi

Eliminate a possible numerical overflow.


44146 19-Feb-1999 luoqi

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


44135 19-Feb-1999 dillon

Submitted by: Alan Cox <alc@cs.rice.edu>

Remove remaining share map garbage from vm_map_lookup() and clean out
old #if 0 stuff.


44124 18-Feb-1999 dillon

Limit number of simultanious asynchronous swap pager I/Os that can
be in progress at any given moment.

Add two swap tuneables to sysctl:

vm.swap_async_max: 4
vm.swap_cluster_max: 16

Recommended values are a cluster size of 8 or 16 pages. async_max is
about right for 1-4 swap devices. Reduce to 2 if swap is eating too much
bandwidth, or even 1 if swap is both eating too much bandwidth and sitting
on a slow network (10BaseT).

The defaults work well across a broad range of configurations and should
normally be left alone.


44098 17-Feb-1999 dillon

Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>

Unlock vnode before messing with map to avoid deadlock between map and
vnode ( e.g. with exec_map and underlying program binary vnode ). Solves
a deadlock that most often occurs during a large -j# buildworld reported
by three people.


44051 15-Feb-1999 dillon

Minor reorganization of vm_page_alloc(). No functional changes have
been made but the code has been reorganized and documented to make
it more readable, reduce the size of the code, and optimize the branch
path caching capabilities that most modern processors have.


44034 15-Feb-1999 dillon

Fix a bug in the new madvise() code that would possibly (improperly)
free swap space out from under a busy page. This is not legal because
the swap may be reallocated and I/O issued while I/O is still in
progress on the same swap page from the madvise()'d object. This bug
could only occur under extreme paging conditions but might not cause
an error until much later. As a side-benefit, madvise() is now even
smaller.


43941 12-Feb-1999 dillon

Minor optimization to madvise() MADV_FREE to make page as freeable as
possible without actually unmapping it from the process.

As of now, I declare madvise() on OBJT_DEFAULT/OBJT_SWAP objects to be
'working and complete'.


43923 12-Feb-1999 dillon

Fix non-fatal bug in vm_map_insert() which improperly cleared
OBJ_ONEMAPPING in the case where an object is extended by an
additional vm_map_entry must be allocated.

In vm_object_madvise(), remove calll to vm_page_cache() in MADV_FREE
case in order to avoid a page fault on page reuse. However, we still
mark the page as clean and destroy any swap backing store.

Submitted by: Alan Cox <alc@cs.rice.edu>


43795 09-Feb-1999 dillon

Addendum to vm_map coalesce optimization. Also, this was backed-out
because there was a concensus on current in regards to leaving bss r+w+x
instead of r+w. This is in order to maintain reasonable compatibility
with existing JIT compilers (e.g. kaffe) and possibly other programs.


43777 08-Feb-1999 dillon

Revamp vm_object_[q]collapse(). Despite the complexity of this patch,
no major operational changes were made. The three core object->memq loops
were moved into a single inline procedure and various operational
characteristics of the collapse function were documented.


43761 08-Feb-1999 dillon

General cleanup. Remove #if 0's and remove useless register qualifiers.


43752 08-Feb-1999 dillon

Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.

* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)

PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).

When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.

Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.


43751 08-Feb-1999 dillon

Backed out vm_map coalesce optimization - it resulted in 22% more page
faults for reasons unknown ( under investigation ).
/usr/bin/time -l make in /usr/src/bin went from 67000 faults to 90000
faults.


43748 07-Feb-1999 dillon

Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.


43747 07-Feb-1999 dillon

Remove L1 cache coloring optimization ( leave L2 cache coloring opt ).

Rewrite vm_page_list_find() and vm_page_select_free() - make inline out
of nominal case.


43729 07-Feb-1999 dillon

When shadowing objects, adjust the page coloring of the shadowing object
such that pages in the combined/shadowed object are consistantly
colored.

Submitted by: "John S. Dyson" <dyson@iquest.net>


43700 06-Feb-1999 dillon

Add hysteresis to the 'swap_pager_getswapspace; failed' console message.
Also widen the hysteresis levels a little ( these really should be
dynamically configured ).


43638 05-Feb-1999 dillon

The elf loader sets the permissions on bss to VM_PROT_READ|VM_PROT_WRITE
rather then VM_PROT_ALL. obreak, on the otherhand, uses VM_PROT_ALL.
This prevents vm_map_insert() from being able to coalesce the heap and
creates an extra map entry. Since current architectures ignore
VM_PROT_EXECUTE anyway, and since not having VM_PROT_EXECUTE on data/bss
may provide protection in the future, obreak now uses read+write rather
then all (r+w+x).

This is an optimization, not a bug fix.

Submitted by: Alan Cox <alc@cs.rice.edu>


43616 04-Feb-1999 dillon

Fix bug in a KASSERT I introduced in vm_page_qcollapse() rev 1.139.

Since paging is in progress, page scan in vm_page_qcollapse() must be
protected at atleast splbio() to prevent pages from being ripped out from
under the scan.


43547 03-Feb-1999 dillon

Submitted by: Alan Cox

The vm_map_insert()/vm_object_coalesce() optimization has been extended
to include OBJT_SWAP objects as well as OBJT_DEFAULT objects. This is
possible because it costs nothing to extend an OBJT_SWAP object with
the new swapper. We can't do this with the old swapper. The old swapper
used a linear array that would have had to have been reallocated, costing
time as well as a potential low-memory deadlock.


43493 01-Feb-1999 dillon

This patch eliminates a pointless test from appearing twice
in vm_map_simplify_entry. Basically, once you've verified that
the objects in the adjacent vm_map_entry's are the same, either
NULL or the same vm_object, there's no point in checking that the
objects have the same behavior.

Obtained from: Alan Cox <alc@cs.rice.edu>


43476 31-Jan-1999 julian

Submitted by: Alan Cox <alc@cs.rice.edu>
Checked by: "Richard Seaman, Jr." <dick@tar.com>
Fix the following problem:
As the code stands now, growing any stack, and not just the process's
main stack, modifies vm->vm_ssize. This is inconsistent with the code
earlier in the same procedure.


43311 28-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


43287 27-Jan-1999 dillon

Remove unintended trigraph sequences in comments for -Wall


43209 26-Jan-1999 julian

Mostly remove the VM_STACK OPTION.
This changes the definitions of a few items so that structures are the
same whether or not the option itself is enabled. This allows
people to enable and disable the option without recompilng the world.

As the author says:

|I ran into a problem pulling out the VM_STACK option. I was aware of this
|when I first did the work, but then forgot about it. The VM_STACK stuff
|has some code changes in the i386 branch. There need to be corresponding
|changes in the alpha branch before it can come out completely.

what is done:
|
|1) Pull the VM_STACK option out of the header files it appears in. This
|really shouldn't affect anything that executes with or without the rest
|of the VM_STACK patches. The vm_map_entry will then always have one
|extra element (avail_ssize). It just won't be used if the VM_STACK
|option is not turned on.
|
|I've also pulled the option out of vm_map.c. This shouldn't harm anything,
|since the routines that are enabled as a result are not called unless
|the VM_STACK option is enabled elsewhere.
|
|2) Add what appears to be appropriate code the the alpha branch, still
|protected behind the VM_STACK switch. I don't have an alpha machine,
|so we would need to get some testers with alpha machines to try it out.
|
|Once there is some testing, we can consider making the change permanent
|for both i386 and alpha.
|
[..]
|
|Once the alpha code is adequately tested, we can pull VM_STACK out
|everywhere.
|

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


43208 26-Jan-1999 julian

Enable Linux threads support by default.
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


43145 24-Jan-1999 dillon

Undo last commit - not a bug, just duplicate code. PG_MAPPED and
PG_WRITEABLE are already cleared by vm_page_protect().


43138 24-Jan-1999 dillon

Change all manual settings of vm_page_t->dirty = VM_PAGE_BITS_ALL
to use the vm_page_dirty() inline.

The inline can thus do sanity checks ( or not ) over all cases.


43136 24-Jan-1999 dillon

vm_map_split() used to dirty the page manually after calling
vm_page_rename(), but never pulled the page off PQ_CACHE if it was on
PQ_CACHE. Dirty pages in PQ_CACHE are not allowed and a KASSERT was
added in -4.x to test for this... and got hit.

In -4.x, vm_page_rename() automatically dirties the page. This commit
also has it deal with the PQ_CACHE case, deactivating the page in that
case.


43134 24-Jan-1999 dillon

Add vm_page_dirty() inline with PQ_CACHE sanity check


43129 24-Jan-1999 dillon

vm_pager_put_pages() is passed an rcval array to hold per-page return
values. The 'int' return value for the procedure was never used and
not well defined in any case when there are mixed errors on pages, so
it has been removed. vm_pager_put_pages() and associated vm_pager
functions now return void.


43128 24-Jan-1999 dillon

Clear PG_MAPPED as well as PG_WRITEABLE when a page is moved to the
cache.


43127 24-Jan-1999 dillon

Added warning printf ( needs INVARIANTS ) when busy cache page is found
while trying to free memory.


43123 24-Jan-1999 dillon

It is possible for a page in the cache to be busy. vm_pageout.c was not
checking for this condition while it tried to free cache pages. Fixed.


43122 24-Jan-1999 dillon

Add invariants to vm_page_busy() and vm_page_wakeup() to check for
PG_BUSY stupidity.


43121 24-Jan-1999 dillon

Clear PG_WRITEABLE in vm_page_cache(). This may or may not be a bug,
but the bit should definitely be cleared.


43120 24-Jan-1999 dillon

Depreciate vm_object_pmap_copy() - nobody uses it. Everyone uses
vm_object_pmap_copt_1() now, apparently.


43119 24-Jan-1999 dillon

Get rid of unused old_m in vm_fault. Add INVARIANTS to test whether
page is still busy after all the hell vm_fault goes through.. it is
supposed to be, and printf() if it isn't. don't panic, though.


43086 23-Jan-1999 dillon

Reenable John Dyson's low-memory VM_WAIT code for page reactivations out
of PQ_CACHE. Add comments explaining what it accomplishes and its
limitations.


42979 21-Jan-1999 dillon

Mainly changes to support the new swapper. The big adjustment is that
swap blocks are now in PAGE_SIZE'd increments instead of DEV_BSIZE'd
increments. We still convert to DEV_BSIZE'd increments for the
backing store I/O, but everything else is in PAGE_SIZE increments.


42978 21-Jan-1999 dillon

Move many of the vm_pager_*() functions from vm_pager.c to inlines in
vm_pager.h


42977 21-Jan-1999 dillon

Move many of the vm_pager_*() functions from vm_pager.c to inlines in
vm_pager.h

Added argument to getpbuf() and relpbuf() to allow each subsystem to
specify a different hard limit on the number of simultanious physical
bufferes that said subsystem may allocate. Without this feature, one
subsystem ( e.g. the vfs clustering code ) could hog *ALL* the pbufs,
causing a deadlock in the pager in a low memory situation.

Same for trypbuf().


42976 21-Jan-1999 dillon

Reorganized some of the low memory testing code to make it more useful.

Removed call to vm_object_collapse(), which can block. This was being
called without the pageout code holding any sort of reference on the
vm_object or vm_page_t structures being manipulated. Since this code
can block, it was possible for other kernel code to shred the state
the pageout code was assuming remained intact.

Fixed potential blocking condition in vm_pageout_page_free() ( which
could cause a deadlock in a low-memory situation ).

Currently there is a hack in-place to deal with clean filesystem meta-data
polluting the inactive page queue. John doesn't like the hack, and neither
do I.

Revamped and commented a portion of the pageout loop.

Added protection against potential memory deadlocks with OBJT_VNODE
when using VOP_ISLOCKED(). The problem is that vp->v_data can be NULL
which causes VOP_ISLOCKED() to return a less informed answer.

remove vm_pager_sync() -- none of the pagers use it any more ( the old
swapper used to. The new one does not ).


42975 21-Jan-1999 dillon

The TAILQ hashq has been turned into a singly-linked=list link,
reducing the size of vm_page_t.

SWAPBLK_NONE and SWAPBLK_MASK are defined here. These actually are
more generalized then their names imply, but their placement is somewhat
of a legacy issue from a prior test version of this code that put
the swapblk in the vm_page_t structure. That test code was eventually
thrown away. The legacy remains.

Added vm_page_flash() inline. Similar to vm_page_wakeup() except that
it does not clear PG_BUSY ( one assumes that PG_BUSY is already clear ).
Used by a number of routines to wakeup waiters.

Collapsed some of the code in inline calls to make other inline calls.
GCC will optimize this well and it reduces duplication.

vm_page_free() and vm_page_free_zero() inlines added to convert to
the proper vm_page_free_toq() call.

vm_page_sleep_busy() inline added, replacing vm_page_sleep() ( which has
been removed ). This implements a much more optimizable page-waiting
function.


42974 21-Jan-1999 dillon

The hash table used to be a table of doubly-link list headers ( two
pointers per entry ). The table has been changed to a singly linked
list of vm_page_t pointers. The table has been doubled in size, but
the entries only take half the space so a net-zero change in memory use.

The hash function has been changed, hopefully for the better. The
combination of the larger hash table size of changed function should
keep the chain length down to a reasonable number (0-3, average 1).

vm_object->page_hint has been removed. This 'optimization' was not
only never needed, but costs as much as a hash chain link to implement.
While having page_hint in vm_object might result in better locality
of reference, the cost is not worth the space in vm_object or the
extra instructions in my view.

vm_page_alloc*() functions have been inlined and call a generalized
non-inlined vm_page_alloc_toq() which combines the standard alloc
and zero-page alloc functions together, reducing code size and the L1
cache footprint. Some reordering has been done... not much. The
delinking code should be faster ( because unlinking a doubly-linked list
requires four memory ops and unlinking a singly linked list only requires
two ), and we get a hash consistancy check for free.

vm_page_rename() now automatically sets the page's dirty bits.

vm_page_alloc() does not try to manually inline freeing a cache page.
Instead, it now properly calls vm_page_free(m) ... vm_page_free() is
really too complex to manually inline.

vm_await(), supporting asleep(), has been added.


42973 21-Jan-1999 dillon

The vm_object structure is now somewhat smaller due to the removal
of most of the swap-pager-specific fields, the removal of the id,
and the removal of paging_offset.

A new inline, vm_object_pip_wakeupn() has been added to subtract an
arbitrary number n from the paging_in_progress count and then wakeup
waiters as necessary. n may be 0, resulting in a 'flash'.


42972 21-Jan-1999 dillon

object->id was badly implemented. It has simply been removed.

object->paging_offset has been removed - it was used to optimize a
single OBJT_SWAP collapse case yet introduced massive confusion throughout
vm_object.c. The optimization was inconsequential except for the
claim that it didn't have to allocate any memory. The optimization
has been removed.

madvise() has been fixed. The old madvise() could be made to operate
on shared objects which is a big no-no. The new one is much more careful
in what it modifies. MADV_FREE was totally broken and has now been fixed.

vm_page_rename() now automatically dirties a page, so explicit dirtying
of the page prior to calling vm_page_rename() has been removed.


42971 21-Jan-1999 dillon

Objects associated with raw devices are no longer counted in the VM stats
total because they may contain absurd numbers ( like the size of all
of physical memory if you mmap() /dev/mem ).


42970 21-Jan-1999 dillon

General cleanup related to the new pager. We no longer have to worry
about conversions of objects to OBJT_SWAP, it is done automatically
now.

Replaced manually inserted code with inline calls for busy waiting on
pages, which also incidently fixes a potential PG_BUSY race due to
the code not running at splvm().

vm_objects no longer have a paging_offset field ( see vm/vm_object.c )


42969 21-Jan-1999 dillon

Potential bug fix, do not just clear PG_BUSY... call vm_page_wakeup()
instead to properly handle any waiters.

Added comments, added support for M_ASLEEP. Generally treat M_ flags
as flags instead of constants to compare against.


42968 21-Jan-1999 dillon

Removed low-memory blockages at fork. This is the wrong place to put
this sort of test. We need to fix the low-memory handling in general.


42967 21-Jan-1999 dillon

Mainly cleanup. Removed some inappropriate low-memory handling code
and added lots of comments. Add tie-in to vm_pager ( and thus the
new swapper ) to deallocate backing swap for dirtied pages on the fly.


42966 21-Jan-1999 dillon

The default_pager's interaction with the swap_pager has been reorganized,
and the swap_pager has been completely replaced.

The new swap pager uses the new blist radix-tree based bitmap allocator
for low level swap allocation and deallocation. The new allocator
is effectively O(5) while the old one was O(N), and the new allocator
allocates all required memory at init time rather then at allocate
memory on the fly at run time.

Swap metadata is allocated in clusters and stored in a hash table,
eliminating linearly allocated structures.

Many, many features have been rewritten or added. Swap space is now
reallocated on the fly providing a poor-mans auto defragmentation of
swap space. Swap space that is no longer needed is freed on a timely
basis so no garbage collection is necessary.

Swap I/O is marked B_ASYNC and NFS has been fixed to do the right
thing with it, so NFS-based paging now has around 10x the performance
as it did before ( previously NFS enforced synchronous I/O for paging ).


42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


42453 10-Jan-1999 eivind

KNFize, by bde.


42408 08-Jan-1999 eivind

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


42379 07-Jan-1999 julian

Changes to the LINUX_THREADS support to only allocate extra memory for
shared signal handling when there is shared signal handling being
used.

This removes the main objection to making the shared signal handling
a standard ability in rfork() and friends and 'unconditionalising'
this code. (i.e. the allocation of an extra 328 bytes per process).

Signal handling information remains in the U area until such a time as
it's reference count would be incremented to > 1. At that point a new
struct is malloc'd and maintained in KVM so that it can be shared between
the processes (threads) using it.

A function to check the reference count and move the struct back to the U
area when it drops back to 1 is also supplied. Signal information is
therefore now swapable for all processes that are not sharing that
information with other processes. THis should addres the concerns raised
by Garrett and others.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


42360 06-Jan-1999 julian

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


42248 02-Jan-1999 bde

Ifdefed conditionally used simplock variables.


42153 29-Dec-1998 dt

Don't free swap in swap_pager_getpages(): this code probably cause the
"dying daemons" problem. (I thought this code was introduced in rev.1.80,
but it just relaxed the condition.)

Also, kill related "suggest more swap space" warning (also introduced in
1.80). It was confusing, to say the least...

Requested by: msmith
Not objected by: dg


42026 23-Dec-1998 dillon

Update comments to routines in vm_page.c, most especially whether a
routine can block or not as part of a general effort to carefully
document blocking/non-blocking calls in the kernel.


41936 19-Dec-1998 julian

Fix two bogons created by 'patch(1)' in my last commit.


41931 19-Dec-1998 julian

Reviewed by: Luoqi Chen, Jordan Hubbard
Submitted by: "Richard Seaman, Jr." <lists@tar.com>
Obtained from: linux :-)

Code to allow Linux Threads to run under FreeBSD.

By default not enabled
This code is dependent on the conditional
COMPAT_LINUX_THREADS (suggested by Garret)
This is not yet a 'real' option but will be within some number of hours.


41620 09-Dec-1998 dt

Don't disable mmap with large file offset.


41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


41514 04-Dec-1998 archie

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


41503 04-Dec-1998 rvb

In vnode_pager_input_old, set auio.uio_procp = curproc
vs auio.uio_procp = (struct proc *) 0


41322 25-Nov-1998 dg

Add missing splvm protection around unqueue call. Without this, the page
queues would eventually get corrupted.


41250 19-Nov-1998 bde

Fixed a null pointer panic in spc_free(). swap_pager_putpages()
almost always causes this panic for the curproc != pageproc case.
This case apparently doesn't happen in normal operation, but it
happens when vm_page_alloc_contig() is called when there is a memory
hogging application that hasn't already been paged out.

PR: 8632
Reviewed by: info@opensound.com (Dev Mazumdar), dg
Broken in: rev.1.89 (1998/02/23)


41093 11-Nov-1998 dg

Closed a small race condition between wiring/unwiring pages that involved
the page's wire_count.


41059 10-Nov-1998 peter

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


41004 08-Nov-1998 dfr

* Fix a couple of places in the device pager where an address was
truncated to 32 bits.
* Change the calling convention of the device mmap entry point to
pass a vm_offset_t instead of an int for the offset allowing
devices with a larger memory map than (1<<32) to be supported
on the alpha (/dev/mem is one such).

These changes are required to allow the X server to mmap the various
I/O regions used for device port and memory access on the alpha.


40931 05-Nov-1998 dg

Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.


40794 31-Oct-1998 peter

Add John Dyson's SYSCTL descriptions, and an export of more stats to
a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present
in source, they do not get compiled into the binaries taking up memory.


40790 31-Oct-1998 peter

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


40701 28-Oct-1998 dg

Fixed wrong comments in and about vm_page_deactivate().


40700 28-Oct-1998 dg

Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.


40673 27-Oct-1998 dg

Added needed splvm() protection around object page traversal in
vm_object_terminate().


40650 25-Oct-1998 bde

Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted
when bdevsw[] became sparse. We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.

Removed a redundant `major(dev) < nblkdev' test instead of updating it.

Don't follow a garbage bdevsw pointer for attempts to swap on empty
regular files. This case currently can't happen. Swapping on regular
files is ifdefed out in swapon() and isn't attempted for empty files
in nfs_mountroot().


40648 25-Oct-1998 phk

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


40605 23-Oct-1998 dg

Oops, revert part of last fix. vm_pager_dealloc() can't be called until
after the pages are removed from the object...so fix the problem by
not printing the diagnostic for wired fictitious pages (which is normal).


40604 23-Oct-1998 dg

Fixed two bugs in recent commit: in vm_object_terminate, vm_pager_dealloc
needs to be called prior to freeing remaining pages in the object so that
the device pager has an opportunity to grab its "fake" pages. Also, in
the case of wired pages, the page must be made busy prior to calling
vm_page_remove. This is a difference from 2.2.x that I overlooked when
I brought these changes forward.


40560 22-Oct-1998 dg

Make the VM system handle the case where a terminating object contains
legitimately wired pages. Currently we print a diagnostic when this
happens, but this will be removed soon when it will be common for this
to occur with zero-copy TCP/IP buffers.


40558 22-Oct-1998 dg

Convert fake page allocs to use the zone allocator, thus eliminating the
private pool management code in here.


40557 21-Oct-1998 dg

Set m->object to NULL in dev_pager_getfake().


40548 21-Oct-1998 dg

Nuked PG_TABLED flag. Replaced with m->object != NULL.


40546 21-Oct-1998 dg

Add a diagnostic printf for freeing a wired page. This will eventually
be turned into a panic, but I want to make sure that all cases of freeing
pages with wire_count==1 (which is/was allowed) have first been fixed.


40286 13-Oct-1998 dg

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


40087 09-Oct-1998 jdp

Fix a panic on SMP systems, caused by sleeping while holding a
simple-lock.

The reviewer raises the following caveat: "I believe these changes
open a non-critical race condition when adding memory to the pool
for the zone. I think what will happen is that you could have two
threads that are simultaneously adding additional memory when the
pool runs out. This appears to not be a problem, however, since
the re-aquisition of the lock will protect the list pointers."
The submitter agrees that the race is non-critical, and points out
that it already existed for the non-SMP case. He suggests that
perhaps a sleep lock (using the lock manager) should be used to
close that race. This might be worth revisiting after 3.0 is
released.

Reviewed by: dg (David Greenman)
Submitted by: tegge (Tor Egge)


39873 01-Oct-1998 jdp

Fix a bug in which a page index was used where a byte offset was
expected. This bug caused builds of Modula-3 to fail in mysterious
ways on SMP kernels. More precisely, such builds failed on systems
with kern.fast_vfork equal to 0, the default and only supported
value for SMP kernels.

PR: kern/7468
Submitted by: tegge (Tor Egge)


39770 29-Sep-1998 abial

Make #define NO_SWAPPING a normal kernel config option.

Reviewed by: jkh


39739 28-Sep-1998 rvb

John Dyson approved of this solution; make vnode_pager_input_old set m->valid


39700 28-Sep-1998 dg

Be more selctive about when we clear p->valid.
Submitted by: John Dyson <toor@dyson.iquest.net>


39512 20-Sep-1998 bde

Removed unused file.


38866 05-Sep-1998 bde

Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.


38799 04-Sep-1998 dfr

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


38729 01-Sep-1998 wollman

Separate wakeup conditions for page I/O count (pg_busy) and lock (PG_BUSY).
This is not sa completely solution to the deadlock, but the additional wakeups
have helped in my observation.

Suggested by: John Dyson


38542 25-Aug-1998 luoqi

Fix a rounding problem that causes vnode pager to fail to remove the last
partially filled page during a truncation.

PR: kern/7422


38517 24-Aug-1998 dfr

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


38479 22-Aug-1998 mckay

Correct/clarify some comments.


38298 13-Aug-1998 dfr

Protect all modifications to paging_in_progress with splvm().


38135 06-Aug-1998 dfr

Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)


37918 28-Jul-1998 bde

Fixed two spl nesting bugs. They caused (at least) the entire pageout
daemon to run at splvm() forever after swap_pager_putpages() is called
from vm_pageout_scan().

Broken in: rev.1.189 (1998/02/23)


37874 26-Jul-1998 dfr

Notify pmap when a page is freed on the alpha to allow it to clean up
its emulated modified/referenced bits.


37843 22-Jul-1998 dg

Improved pager input failure message.


37821 22-Jul-1998 phk

There is a comment in vm_param.h which doesn't belong to the
code still left in there. The macros it describes disapeared some-
time since 4.4BSD lite.

PR: 7246
Reviewed by: phk
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>


37653 15-Jul-1998 bde

Cast pointers to [u]intptr_t instead of to [unsigned] long.


37649 15-Jul-1998 bde

Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.


37641 14-Jul-1998 bde

Print pointers using %p instead of attempting to print them by
casting them to long, etc. Fixed some nearby printf bogons (sign
errors not warned about by gcc, and style bugs, but not truncation
of vm_ooffset_t's).


37640 14-Jul-1998 bde

Print pointers using %p instead of attempting to print them by
casting them to long, etc. Fixed some nearby printf bogons (sign
errors not warned about by gcc, and style bugs, but not truncation
of vm_ooffset_t's).

Use slightly less bogus casts for passing pointers to ddb command
functions.


37563 11-Jul-1998 bde

Fixed printf format errors.


37562 11-Jul-1998 bde

Fixed printf format errors.


37555 11-Jul-1998 bde

Fixed printf format errors.


37546 10-Jul-1998 alex

Removed no longer valid comment about swb_block being int instead of
daddr_t.

PR: 7238
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>


37545 10-Jul-1998 alex

Removed unnecessary test from if/else construct.

PR: 7233
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>


37395 05-Jul-1998 dfr

Don't truncate the return value of mmap to sizeof(int).


37389 04-Jul-1998 julian

There is no such thing any more as "struct bdevsw".

There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).

rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.

cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.

Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.


37384 04-Jul-1998 julian

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


37282 30-Jun-1998 jmg

document some VM paging options for cache sizes:
PQ_NOOPT no coloring
PQ_LARGECACHE used for 512k/16k cache
PQ_HUGECACHE used for 1024k/16k cache


37153 25-Jun-1998 phk

Remove bdevsw_add(), change the only two users to use bdevsw_add_generic().
Extend cdevsw to be superset of bdevsw.
Remove non-functional bdev lkm support.
Teach wcd what the open() args mean.


37101 21-Jun-1998 bde

Removed unused includes.


37094 21-Jun-1998 bde

Removed unused includes.


36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


36677 05-Jun-1998 dg

Changed the log() of "Out of mbuf clusters - increase maxusers" to a
printf() of "Out of mbuf clusters - adjust NMBCLUSTERS or increase
maxusers" so that the message is more informative and so that it will
appear in the kernel message buffer.


36583 02-Jun-1998 dyson

Cleanup and remove some dead code from the initialization.


36582 02-Jun-1998 dyson

Correct sleep priority.


36326 24-May-1998 dyson

Support a 16K first level cache for 512K 2nd level. Also, add support
for 1MB 2nd level cache.


36275 21-May-1998 dyson

Make flushing dirty pages work correctly on filesystems that
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)


36177 19-May-1998 peter

Make the previous commit compile..


36164 18-May-1998 guido

Plug hole reported on Bugtraq: do not allow mmap with WRITE privs for
append-only and immutable files.

Obtained from: OpenBSD (partly)


36112 16-May-1998 dyson

An important fix for proper inheritance of backing objects for
object splits. Another excellent detective job by Tor.
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


35694 04-May-1998 dyson

Fix the shm panic. I mistakenly used the shadow_count to keep the object
from being split, and instead added an OBJ_NOSPLIT.


35669 04-May-1998 dyson

Work around some VM bugs, the worst being an overly aggressive
swap space free calculation. More complete fixes will be forthcoming,
in a week.


35615 02-May-1998 dyson

Another minor cleanup of the split code. Make sure that pages are
busied during the entire time, so that the waits for pages being
unbusy don't make the objects inconsistant.


35612 02-May-1998 peter

Seatbelts for vm_page_bits() in case a file offset is passed in rather than
the page offset. If a large file offset was passed in, a large negative
array index could be generated which could cause page faults etc at worst
and file corruption at the least. (Pages are allocated within file
space on page alignment boundaries, so a file offset being passed in here
is harmless to DTRT. The case where this was happening has already been
fixed though, this is in case it happens again).

Reviewed by: dyson


35571 01-May-1998 dyson

Fix minor bug with new over used swap fix.


35499 29-Apr-1998 dyson

Add a needed prototype, and fix a panic problem with the new
memory code.


35497 29-Apr-1998 dyson

Tighten up management of memory and swap space during map allocation,
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.


35485 28-Apr-1998 dyson

Fix a pseudo-swap leak problem. This mitigates "leaks" due to
freeing partial objects, not freeing entire objects didn't
free any of it. Simple fix to the map code.
Reviewed by: dg


35447 25-Apr-1998 dyson

Correct copyright.


35210 15-Apr-1998 bde

Support compiling with `gcc -ansi'.


34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


34611 16-Mar-1998 dyson

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


34525 12-Mar-1998 guido

Fix for mmap of char devices bug as described in OpenBSD advisory of
1998/02/20
Reviewed by: John Dyson
Submitted by: "Cy Schubert" <cschuber@uumail.gov.bc.ca>


34403 09-Mar-1998 msmith

Complement diagnostic messages about missing per-FS VOP page operations,
but don't make their absence fatal.
Submitted by: terry


34321 08-Mar-1998 dyson

Quell unneeded pageout daemon activity.


34320 08-Mar-1998 dyson

Remove a very ill advised vm_page_protect. This was being called
for a non-managed page. That is a big no-no.


34236 08-Mar-1998 dyson

Some cruft left over from my megacommit. A page rotation optimization
was a good idea, but can cause instability. That optimization is
now removed.


34235 08-Mar-1998 dyson

Several minor fixes:
1) When freeing pages, it is a good idea to protect them off.
(This is probably gratuitious, but good form.)
2) Allow collapsing pages in the backing object that are
PQ_CACHE. This will improve memory utilization.
3) Correct the collapse code so that pages that were on the
cache queue are moved to the inactive queue. This is
done when pages are marked dirty (so that those pages
will be properly paged out instead of freed), so that
cached pages will not be paradoxically marked dirty.


34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


34202 07-Mar-1998 dyson

Make vm_fault much cleaner by removing the evil macro inlines, and
put alot of it's context into a data structure. This allows
significant shortening of its codepath, and will significantly
decrease it's cache footprint.

Also, add some stats to vmmeter. Note that you'll have to
rebuild/recompile vmstat, systat, etc... Otherwise, you'll
get "very interesting" paging stats.


34030 04-Mar-1998 dufault

Reviewed by: msmith, bde long ago
POSIX.4 headers and sysctl variables. Nothing should change
unless POSIX4 is defined or _POSIX_VERSION is set to 199309.


33936 01-Mar-1998 dyson

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


33847 26-Feb-1998 msmith

In the author's words:

These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown
for local media FS's.

See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation
details for generic *_{get|put}pages for local media FS's. Support
is trivial to add for any FS that formerly relied on the default
behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the
ffs_getpages() code for the FS in question's *_{get|put}pages).

Obviously, it would be better if each local media FS implemented a
more optimal method, instead of calling an exported interface from
the /sys/vm/vnode_pager.c, but this is a necessary first step in
getting the FS's to a point where they can be supplied with better
implementations on a case-by-case basis.

Obviously, the cd9660_putpages() can be rather trivial (since it
is a read-only FS type 8-)).

A slight (temporary) modification is made to print a diagnostic message
in the case where the underlying filesystem attempts to engage in the
previous behaviour. Failure is likely to be ungraceful.

Submitted by: terry@freebsd.org (Terry Lambert)


33817 25-Feb-1998 dyson

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


33784 24-Feb-1998 dyson

Correct some severe VM tuning problems for small systems (<=16MB), and
improve tuning on larger systems. (A couple of the VM tuning params for
small systems were so badly chosen that the system could hang under load.)

The broken tuning was originaly my fault.


33758 23-Feb-1998 dyson

Significantly improve the efficiency of the swap pager, which appears to
have declined due to code-rot over time. The swap pager rundown code
has been clean-up, and unneeded wakeups removed. Lots of splbio's
are changed to splvm's. Also, set the dynamic tunables for the
pageout daemon to be more sane for larger systems (thereby decreasing
the daemon overheadla.)


33757 23-Feb-1998 dyson

Try to dynamically size the VM_KMEM_SIZE (but is still able to be overridden
in a way identically as before.) I had problems with the system properly
handling the number of vnodes when there is alot of system memory, and the
default VM_KMEM_SIZE. Two new options "VM_KMEM_SIZE_SCALE" and
"VM_KMEM_SIZE_MAX" have been added to support better auto-sizing for systems
with greater than 128MB.

Add some accouting for vm_zone memory allocations, and provide properly
for vm_zone allocations out of the kmem_map. Also move the vm_zone
allocation stats to the VM OID tree from the KERN OID tree.


33676 20-Feb-1998 bde

Removed unused #includes.


33622 19-Feb-1998 msmith

Move the 'sw' device off block major #1, which is now occupied by 'wfd'.


33181 09-Feb-1998 eivind

Staticize.


33173 08-Feb-1998 dyson

Fix an argument to vn_lock. It appears that alot of the vn_lock usage
is a bit undisciplined, and should be checked carefully.


33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


33109 05-Feb-1998 dyson

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


33058 03-Feb-1998 bde

Added #include of <sys/queue.h> so that this file is more "self"-sufficent.


33034 03-Feb-1998 dyson

This fix should help the panic problems in -current. There
were some errors in "interval" management. Due to the
clustering mechanism, the code is necessarily complex and
error prone.


32995 01-Feb-1998 bde

Forward declare more structs that are used in prototypes here - don't
depend on <sys/types.h> forward declaring common ones.


32952 01-Feb-1998 dyson

Fix a performance problem caused by an earlier commit.


32946 31-Jan-1998 dyson

contigalloc doesn't place the allocated page(s) into an object, and
now this breaks vm_page_wire (due to wired page accounting per object.)

This should fix a problem as described by Donald Maddox.


32937 31-Jan-1998 dyson

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


32751 25-Jan-1998 eivind

Turn NSWAPDEV into a new-style option.


32726 24-Jan-1998 eivind

Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.


32724 24-Jan-1998 dyson

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


32670 21-Jan-1998 dyson

Allow gdb to work again.


32585 17-Jan-1998 dyson

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


32454 12-Jan-1998 dyson

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


32305 07-Jan-1998 dyson

Turn off the VTEXT flag when an object is no longer referenced, so
that an executable that is no longer running can be written to. Also,
clear the OBJ_OPT flag more often, when appropriate.


32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


32132 31-Dec-1997 alex

caddr_t --> void *


32072 29-Dec-1997 dyson

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


32071 29-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


31991 25-Dec-1997 dyson

The ioopt code is still buggy, but wasn't fully disabled.


31970 24-Dec-1997 dyson

Support running with inadequate swap space. Additionally, the code
will complain with a suggestion of increasing it.


31935 22-Dec-1997 dyson

Improve my copyright.


31857 19-Dec-1997 dyson

Change bogus usage of btoc to atop. The incorrect usage of btoc was
pointed out by bde.


31853 19-Dec-1997 dyson

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


31778 16-Dec-1997 eivind

Make COMPAT_43 and COMPAT_SUNOS new-style options.


31729 15-Dec-1997 dyson

Fix a recursive kernel_map lock problem in vm_zone allocator.
PR: 5298


31712 14-Dec-1997 dyson

Slight improvement to the vm_zone stats output. Also, some other superficial
cleanups.


31709 14-Dec-1997 dyson

After one of my analysis passes to evaluate methods for SMP TLB mgmt, I
noticed some major enhancements available for UP situations. The number
of UP TLB flushes is decreased much more than significantly with these
changes. Since a TLB flush appears to cost minimally approx 80 cycles,
this is a "nice" enhancement, equiv to eliminating between 40 and 160
instructions per TLB flush.

Changes include making sure that kernel threads all use the same PTD,
and eliminate unneeded PTD switches at context switch time.


31667 11-Dec-1997 dyson

Fix the prototype for swapout_procs();
Submitted by: dima@best.net


31563 06-Dec-1997 dyson

Support an optional, sysctl enabled feature of idle process swapout. This
is apparently useful for large shell systems, or systems with long running
idle processes. To enable the feature:

sysctl -w vm.swap_idle_enabled=1

Please note that some of the other vm sysctl variables have been renamed
to be more accurate.
Submitted by: Much of it from Matt Dillon <dillon@best.net>


31561 05-Dec-1997 bde

Don't include <sys/lock.h> in headers when only `struct simplelock' is
required. Fixed everything that depended on the pollution.


31550 05-Dec-1997 dyson

Add new (very useful) tunable for pageout daemon. The flag changes
the maximum pageout rate:

sysctl -w vm.vm_maxlaunder=n

1 < n < inf.

If paging heavily on large systems, it is likely that a performance
improvement can be achieved by increasing the parameter. On a large
system, the parm is 32, but numbers as large as 128 can make a big
difference. If paging is expensive, you might try decreasing the
number to 1-8.


31542 04-Dec-1997 dyson

Support applications that need to resist or deny use of swap space.

sysctl -w vm.defer_swap_pageouts=1
Causes the system to resist the use of swap space. In low memory
conditions, performance will decrease.
sysctl -w vm.disable_swap_pageouts=1
Causes the system to mostly disable the use of swap space. In
low memory conditions, the system will likely start killing
processes.


31493 02-Dec-1997 phk

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


31393 24-Nov-1997 bde

Removed all traces of P_IDLEPROC. It was tested but never set.


31392 24-Nov-1997 bde

Don't #define max() to get a version that works with vm_ooffset's.
Just use qmax().

This should be fixed more generally using overloaded functions.


31252 18-Nov-1997 bde

Removed unused #include of <sys/malloc.h>. This file now uses only
zalloc(). Many more cases like this are probably obscured by not
including <vm/zone.h> explicitly (it is spammed into <sys/malloc.h>).


31175 14-Nov-1997 tegge

Simplify map entries during user page wire and user page unwire operations in
vm_map_user_pageable().

Check return value of vm_map_lock_upgrade() during a user page wire operation.


31017 07-Nov-1997 phk

Rename some local variables to avoid shadowing other local variables.

Found by: -Wshadow


31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


30989 06-Nov-1997 dyson

Fix the "missing page" problem. Also, improve the performance of page
allocation in common cases.


30813 28-Oct-1997 bde

Removed unused #includes.


30701 25-Oct-1997 dyson

Support garbage collecting the pmap pv entries. The management doesn't
happen until the system would have nearly failed anyway, so no signficant
overhead is added. This helps large systems with lots of processes.


30700 24-Oct-1997 dyson

Decrease the initial allocation for the zone allocations.


30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


30297 11-Oct-1997 peter

Attempt to fix the previous fix to the contigmalloc1 prototype.
struct malloc_type isn't defined in all cases (eg: from ddb), and the line
wrapping was very badly mangled.


30286 10-Oct-1997 phk

Fix contigmalloc() and contigmalloc1() arguments.


30139 06-Oct-1997 dyson

Improve management of pages moving from the inactive to active queue. Additionally,
add some much needed comments.


30137 06-Oct-1997 dyson

Relax the vnode locking for read only operations.


29657 21-Sep-1997 peter

Fix some style(9) and formatting problems. tabsize 4 formatting doesn't
look too great with 'more' etc.

Approved by: dyson (with a minor grumble :-)


29653 21-Sep-1997 dyson

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


29368 14-Sep-1997 peter

Update select -> poll in drivers.


29324 13-Sep-1997 peter

Print correct function name in panics


29316 12-Sep-1997 jlemon

Do not consider VM_PROT_OVERRIDE_WRITE to be part of the protection
entry when handling a fault. This is set by procfs whenever it wants
to write to a page, as a means of overriding `r-x COW' entries, but
causes failures in the `rwx' case.

Submitted by: bde


29208 07-Sep-1997 bde

Removed yet more vestiges of config-time swap configuration and/or
cleaned up nearby cruft.


28992 01-Sep-1997 bde

Removed unused #includes.


28991 01-Sep-1997 bde

Some staticized variables were still declared to be extern.


28990 01-Sep-1997 bde

Print a device number in hex instead of decimal.


28954 31-Aug-1997 phk

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


28940 30-Aug-1997 peter

Allow non-page aligned file offset mmap's, providing that the system is
allowed to choose the address, or that the MAP_FIXED address has the same
remainder when modulo PAGE_SIZE as the file offset. Apparently this is
posix1003.1b specified behavior. SVR4 and the other *BSD's allow it too.
It costs us nothing to support and means we don't get EINVAL on some mmap
code that works perfectly elsewhere.

Obtained from: NetBSD


28751 25-Aug-1997 bde

Fixed type mismatches for functions with args of type vm_prot_t and/or
vm_inherit_t. These types are smaller than ints, so the prototypes
should have used the promoted type (int) to match the old-style function
definitions. They use just vm_prot_t and/or vm_inherit_t. This depends
on gcc features to work. I fixed the definitions since this is easiest.
The correct fix may be to change the small types to u_int, to optimize
for time instead of space.


28558 22-Aug-1997 dyson

This is a trial improvement for the vnode reference count while on the vnode
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.


28551 21-Aug-1997 bde

#include <machine/limits.h> explicitly in the few places that it is required.


28349 18-Aug-1997 fsmp

Added includes of smp.h for SMP.
This eliminates a bazillion warnings about implicit s_lock & friends.


28345 18-Aug-1997 dyson

Fix kern_lock so that it will work. Additionally, clean-up some of the
VM systems usage of the kernel lock (lockmgr) code. This is a first
pass implementation, and is expected to evolve as needed. The API
for the lock manager code has not changed, but the underlying implementation
has changed significantly. This change should not materially affect
our current SMP or UP code without non-standard parameters being used.


28028 10-Aug-1997 dyson

The "cutsie" register parameter passing that I had mistakenly used breaks
profiling. Since it doesn't really improve perf much, I have backed it
out.


27947 07-Aug-1997 dyson

More vm_zone cleanup. The sysctl now accounts for items better, and
counts the number of allocations.


27930 06-Aug-1997 dyson

Add exposure of some vm_zone allocation stats by sysctl. Also, change
the initialization parameters of some zones in VM map. This contains
only optimizations and not bugfixes.


27924 05-Aug-1997 dyson

Fixed the commit botch that was causing crashes soon after system
startup. Due to the error, the initialization of the zone for
pv_entries was missing. The system should be usable again.


27923 05-Aug-1997 dyson

Another attempt at cleaning up the new memory allocator.


27922 05-Aug-1997 dyson

Fix some bugs, document vm_zone better. Add copyright to vm_zone.h. Use
the new zone code in pmap.c so that we can get rid of the ugly ad-hoc
allocations in pmap.c.


27905 05-Aug-1997 dyson

Modify pmap to use our new memory allocator. Also, change the vm_map_entry
allocations to be interrupt safe.


27901 05-Aug-1997 dyson

A very simple zone allocator.


27899 05-Aug-1997 dyson

Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of
a simple, clean zone type allocator. This new allocator will also be
used for machine dependent pmap PV entries.


27845 02-Aug-1997 bde

Removed unused #includes.


27716 27-Jul-1997 dyson

Add the ability for the pageout daemon to measure stats on memory usage before
the system is out of memory. The daemon does a minimal amount of work that
increases as the system becomes more likely to run out of memory and page in/out.

The default tuning is fairly low in background CPU usage, and sysctl variables
have been added to enable flexable operation. This is an experimental feature
that will likely be changed and improved over time.


27715 27-Jul-1997 dyson

Fix a very subtile problem that causes unnessary numbers of objects backing
a single logical object.
Submitted by: Alan Cox <alc@cs.rice.edu>


27464 17-Jul-1997 dyson

Add support for 4MB pages. This includes the .text, .data, .data parts
of the kernel, and also most of the dynamic parts of the kernel. Additionally,
4MB pages will be allocated for display buffers as appropriate (only.)

The 4MB support for SMP isn't complete, but doesn't interfere with operation
either.


26851 23-Jun-1997 tegge

Don't try upgrading an existing exclusive lock in vm_map_user_pageable.
This should close PR kern/3180.
Also remove a bogus unconditional call to vm_map_unlock_read in
vm_map_lookup.


26811 22-Jun-1997 peter

Kill some stale leftovers from the earlier attempts at SMP per-cpu pages


26780 22-Jun-1997 dyson

Remove a window during running down a file vnode. Also, the OBJ_DEAD
flag wasn't being respected during vref(), et. al. Note that this
isn't the eventual fix for the locking problem. Fine grained SMP
in the VM and VFS code will require (lots) more work.


26668 15-Jun-1997 dyson

Correct the return code for the mlock system call. Also add the stubs
for mlockall and munlockall.


26667 15-Jun-1997 dyson

Fix a reference problem with maps. Only appears to manifest itself when
sharing address spaces.


26258 29-May-1997 peter

Update the #include "opt_smpxxx.h" includes - opt_smp.h isn't needed
very much in the generic parts of the kernel now.


25930 19-May-1997 dfr

Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson


25352 01-May-1997 dyson

Check the correct queue for waking up the pageout daemon. Specifically,
the pageout daemon wasn't always being waken up appropriately when the
(cache + free) queues were depleted.
Submitted by: David S. Miller <davem@jenolan.rutgers.edu>


25164 26-Apr-1997 peter

Man the liferafts! Here comes the long awaited SMP -> -current merge!

There are various options documented in i386/conf/LINT, there is more to
come over the next few days.

The kernel should run pretty much "as before" without the options to
activate SMP mode.

There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.

This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!


25074 21-Apr-1997 peter

Send this to the Attic so there's no mixups over which kern_lock.c is in
use in -current.


24917 14-Apr-1997 peter

Unused variable (upobj is now purely handled within pmap)


24848 13-Apr-1997 dyson

Fully implement vfork. Vfork is now much much faster than even our
fork. (On my machine, fork is about 240usecs, vfork is 78usecs.)

Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory
from the other threads of a group.

Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating
possible existing shares with other threads/processes.

Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a
thread from the rest of the group.

Fix the case where a thread does an exec. It is almost nonsense for a thread
to modify the other threads address space by an exec, so we
now automatically divorce the address space before modifying it.


24691 07-Apr-1997 peter

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


24678 06-Apr-1997 peter

Commit a typo fix that's been sitting in my tree for ages, quite forgotten.
The typo was detected once apon a time with the -Wunused compile option.
The result was that a block of code for implementing
madvise(.. MADV_SEQUENTIAL..) behavior was "dead" and unused, probably
negating the effect of activating the option.

Reviewed by: dyson


24668 06-Apr-1997 dyson

Make vm_map_protect be more complete about map simplification. This
is useful when a process changes it's page range protections very
much.
Submitted by: Alan Cox <alc@cs.rice.edu>


24667 06-Apr-1997 dyson

Correction to the prototype for vm_fault.


24666 06-Apr-1997 dyson

Fix the gdb executable modify problem. Thanks to the detective work
by Alan Cox <alc@cs.rice.edu>, and his description of the problem.

The bug was primarily in procfs_mem, but the mistake likely happened
due to the lack of vm system support for the operation. I added
better support for selective marking of page dirty flags so that
vm_map_pageable(wiring) will not cause this problem again.

The code in procfs_mem is now less bogus (but maybe still a little
so.)


24478 01-Apr-1997 bde

Removed potentially harmful garbage <vm/lock.h> and fixed bogus
use of it. It was actually harmless because the use was null due
to fortuitous include orders and identical (wrong) idempotency
macros.


24437 31-Mar-1997 dg

Changed the way that the exec image header is read to be filesystem-
centric rather than VM-centric to fix a problem with errors not being
detectable when the header is read.
Killed exech_map as a result of these changes.
There appears to be no performance difference with this change.


24131 23-Mar-1997 bde

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


24130 23-Mar-1997 dyson

Fix a significant error in the accounting for pre-zeroed pages. This
is a candidate for RELENG_2_2...


23502 08-Mar-1997 dyson

When removing IN_RECURSE support during the Lite/2 merge, read/write
to/from mmaped regions was broken. This commit fixes the breakage, and
uses the new Lite/2 locking mechanisms.


23157 27-Feb-1997 bde

Removed a wrong LK_INTERLOCK flag.


22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


22878 18-Feb-1997 bde

Removed vestiges of Mach lock types.

vm_map.h:
Removed #include of <sys/proc.h>. curproc is only used in some macros
and users of the macros already include <sys/proc.h>.


22670 13-Feb-1997 wollman

Provide an alternative interface to contigmalloc() which allows a specific
map to be used when allocating the kernel va (e.g., mb_map). The VM
gurus may want to look this over.


22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


22156 31-Jan-1997 dyson

Another fix to inheriting shared segments. Do the copy on write
thing if needed.
Submitted by: Alan Cox <alc@cs.rice.edu>


21987 24-Jan-1997 dg

Added a check/panic for v_usecount being 0 (no vnode reference) in
vnode_pager_alloc().


21940 22-Jan-1997 dyson

Fix two problems where a NULL object is dereferenced. One problem
was in the VM_INHERIT_SHARE case of vmspace_fork, and also in vm_map_madvise.
Submitted by: Alan Cox <alc@cs.rice.edu>


21881 20-Jan-1997 dyson

Make MADV_FREE work better. Specifically, it did not wait for
the page to be unbusy, and it caused some algorithmic problems
as a result. There were some other problems with it also, so
this is a general cleanup of the code.
Submitted by: Douglas Crosher <dtc@scrooge.ee.swin.oz.au> and myself.


21754 16-Jan-1997 dyson

Change the map entry flags from bitfields to bitmasks. Allows
for some code simplification.


21737 15-Jan-1997 dg

Fix bug related to map entry allocations where a sleep might be attempted
when allocating memory for network buffers at interrupt time. This is due
to inadequate checking for the new mcl_map. Fixed by merging mb_map and
mcl_map into a single mb_map.

Reviewed by: wollman


21733 15-Jan-1997 bde

Removed redundant spl0()'s from kernel processes. They were work-arounds
for a bug in fork().


21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


21530 11-Jan-1997 dyson

Slightly correct the code that moves pages from the active to the
inactive queue. This is only a minor performance improvement, but will
not affect perf on machines that don't have ref bits.


21529 11-Jan-1997 dyson

Prepare better for multi-platform by eliminating another required
pmap routine (pmap_is_referenced.) Upper level recoded to use
pmap_ts_referenced.


21258 03-Jan-1997 dyson

Undo the collapse breakage (swap space usage problem.)


21157 01-Jan-1997 dyson

Guess what? We left alot of the old collapse code that is not needed
anymore with the "full" collapse fix that we added about 1yr ago!!! The
code has been removed by optioning it out for now, so we can put it back
in ASAP if any problems are found.


21134 31-Dec-1996 dyson

A very significant improvement in the management of process maps
and objects. Previously, "fancy" memory management techniques
such as that used by the M3 RTS would have the tendancy of chopping
up processes allocated memory into lots of little objects. Alan
has come up with some improvements to migtigate the sitution to
the point where even the M3 RTS only has one object for bss and
it's managed memory (when running CVSUP.) (There are still cases where the
situation isn't improved when the system pages -- but this is much much
better for the vast majority of cases.) The system will now be able
to much more effectively merge map entries.

Submitted by: Alan Cox <alc@cs.rice.edu>


21039 30-Dec-1996 dyson

Let the VM system know that on certain arch's that VM_PROT_READ
also implies VM_PROT_EXEC. We support it that way for now,
since the break system call by default gives VM_PROT_ALL. Now
we have a better chance of coalesing map entries when mixing
mmap/break type operations. This was contributing to excessive
numbers of map entries on the modula-3 runtime system. The
problem is still not "solved", but the situation makes more
sense.

Eventually, when we work on architectures where VM_PROT_READ
is orthogonal to VM_PROT_EXEC, we will have to visit this
issue carefully (esp. regarding security issues.)


21037 30-Dec-1996 dyson

EEEK!!! useracc and kernacc didn't lock their respective
maps. Additionally, eliminate the map->hint distortion
associated with useracc. That may/may-not be the "right"
thing to do -- but time will tell.
Submitted by: Partially by Alan Cox <alc@cs.rice.edu>


20999 29-Dec-1996 dyson

Superficial cleanup of comment.


20993 28-Dec-1996 dyson

Eliminate the redundancy due to the similarity between the routines
vm_map_simplify and vm_map_simplify_entry. Make vm_map_simplify_entry
handle wired maps so that we can get rid of vm_map_simplify. Modify
the callers of vm_map_simplify to properly use vm_map_simplify_entry.
Submitted by: Alan Cox <alc@cs.rice.edu>


20991 28-Dec-1996 dyson

The code unnecessarily created an object with no handle up-front, which
has the negative effect of disabling some map optimizations. This
patch defers the creation of the object until it needs to be at fault time.
Submitted by: Alan Cox <alc@cs.rice.edu>


20821 22-Dec-1996 joerg

Make DFLDSIZ and MAXDSIZ fully-supported options.

"Don't forget to do a ``make depend''" :-)


20449 14-Dec-1996 dyson

Implement closer-to POSIX mlock semantics. The major difference is
that we do allow mlock to span unallocated regions (of course, not
mlocking them.) We also allow mlocking of RO regions (which the old
code couldn't.) The restriction there is that once a RO region is
wired (mlocked), it cannot be debugged (or EVER written to.)

Under normal usage, the new mlock code will be a significant improvement
over our old stuff.


20189 07-Dec-1996 dyson

Expunge inlines...


20187 07-Dec-1996 dyson

Fix a map entry leak problem found by DG. Also, de-inline a function
vm_map_entry_dispose, because it won't help being inlined.


20182 07-Dec-1996 dyson

Make vm_map_insert much more intelligent in the MAP_NOFAULT case so
that map entries are coalesced when appropriate. Also, conditionalize
some code that is currently not used in vm_map_insert. This mod
has been added to eliminate unnecessary map entries in buffer map.

Additionally, there were some cases where map coalescing could be done
when it shouldn't. That problem has been resolved.


20054 30-Nov-1996 dyson

Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation
scheme. Additionally, add the capability for checking for unexpected
kernel page faults. The maximum amount of kva space for buffers hasn't
been decreased from where it is, but it will now be possible to do so.

This scheme manages the kva space similar to the buffers themselves. If
there isn't enough kva space because of usage or fragementation, buffers
will be reclaimed until a buffer allocation is successful. This scheme
should be very resistant to fragmentation problems until/if the LFS code
is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS
is not likely to use such a scheme.

Now there should be NO problem allocating buffers up to MAXPHYS.


20007 28-Nov-1996 dyson

Make the kernel smaller with at worst a neutral effect on perf by
de-inlining some VM calls. (Actually, I measured a small improvement.)


19830 17-Nov-1996 dyson

Improve the locality of reference for variables in vm_page and
vm_kern by moving them from .bss to .data. With this change,
there is a measurable perf improvement in fork/exec.


19415 05-Nov-1996 dyson

Vastly improved contigmalloc routine. It does not solve the
problem of allocating contiguous buffer memory in general, but
make it much more likely to work at boot-up time. The best
chance for an LKM-type load of a sound driver is immediately
after the mount of the root filesystem.

This appears to work for a 64K allocation on an 8MB system.


19259 29-Oct-1996 dyson

Change mmap to use OBJT_DEFAULT instead of OBJT_SWAP by default
for anonymous objects. The system will automatically change the
type to SWAP if needed (for size or pageout reasons.)


19216 27-Oct-1996 phk

The way we get a vnode for swapdev is not quite kosher. In particular
it breaks in the DEVFS_ROOT case. replicate a bit too much of bdevvp()
in here to circumvent the problem. The real problem is the magic that
lives in bdevsw[1].


19142 24-Oct-1996 dyson

Remove a bogus optimization in the mmap code. It is superfluous,
and at best is the same speed as the unoptimized code. At worst, it
slows down trivial programs.


18974 17-Oct-1996 dyson

Make processes waken up eligible for immediate swap-in.


18973 17-Oct-1996 dyson

Clean up the rundown of the object backing a vnode. This should fix
NFS problems associated with forcible dismounts.


18942 15-Oct-1996 bde

Removed nested include of <sys/proc.h> from <vm/vm_object.h> and fixed
the one place that depended on it. wakeup() is now prototyped in
<sys/systm.h> so that it is normally visible.

Added nested include of <sys/queue.h> in <vm/vm_object.h>. The queue
macros are a more fundamental prerequisite for <vm/vm_object.h> than
the wakeup prototype and previously happened to be included by
namespace pollution from <sys/proc.h> or elsewhere.


18937 15-Oct-1996 dyson

Move much of the machine dependent code from vm_glue.c into
pmap.c. Along with the improved organization, small proc fork
performance is now about 5%-10% faster.


18908 13-Oct-1996 phk

Remove a stale comment.


18893 12-Oct-1996 bde

Removed __pure's and __pure2's. __pure is a no-op for recent versions
of gcc by definition, and __pure2 is a no-op in effect (presumably the
compiler can see when an inline function has no side effects).


18779 06-Oct-1996 dyson

Make the default cache size optim to be 256K, the old default was
64K. The change has essentially neutral effect on those machines with
little or no cache, and has a positive effect on "normal" machines
with 256K or more cache.


18768 06-Oct-1996 dyson

Fix a problem with the page coloring code that the system will not always
be able to use all of the free pages. This can manifest as a panic
using DIAGNOSTIC, or as a panic on an indirect memory reference.


18542 28-Sep-1996 bde

Fixed undeclared variables for the !(PQ_L2_SIZE > 1) case.

Removed redundant #include.


18526 28-Sep-1996 dyson

Reviewed by:
Submitted by:
Obtained from:


18389 19-Sep-1996 dg

Fixed bug with reversed trunc/round_page() in madvise...start must be
trunced, end must be rounded.


18307 15-Sep-1996 bde

Removed iprintf(). It was copied to db_iprintf() in ddb.


18298 14-Sep-1996 bde

Attached vm ddb commands `show map', `show vmochk', `show object',
`show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff
to the ends of the vm source files.

Changed printf() to db_printf(), `indent' to db_indent, and iprintf()
to db_iprintf() in ddb commands. Moved db_indent and db_iprintf()
from vm to ddb.

vm_page.c:
Don't use __pure. Staticized.

db_output.c:
Reduced page width from 80 to 79 to inhibit double spacing for long
lines (there are still some problems if words are printed across
column 79).


18205 10-Sep-1996 dyson

The whole issue of not support VOP_LOCK for VBLK devices should be
rethought. This fixes YET another problem with unmounting filesystems.
The root cause is not fixed here, but at least the problem has gone
away.


18178 08-Sep-1996 dyson

Fixed the use of the wrong variable in vm_map_madvise.


18169 08-Sep-1996 dyson

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


18163 08-Sep-1996 dyson

Improve the scalability of certain pmap operations.


17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


17334 30-Jul-1996 dyson

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


17313 28-Jul-1996 dg

Slight performance tweak for previous commit.


17312 28-Jul-1996 dyson

Undo part of the scalability commit. Many of the changes
in vm_fault had some performance enhancements not ready
for prime time. This commit backs out some of the changes.


17301 27-Jul-1996 dyson

Allow sequentially created mmap'ed anonymous regions to coalesce. There
is little or no reason to create a swap pager for small mmap's. The
vm_map_insert code will automatically create a swap pager if the object
becomes too large. This fix, per a request from phk.


17298 27-Jul-1996 dyson

Clean up some lint.


17297 27-Jul-1996 dyson

Remove experimental header file. My test-build must have picked it
up in an unexpected place.
Submitted by: jkh


17295 27-Jul-1996 dyson

Missing (prototype) change from the previous commit.


17294 27-Jul-1996 dyson

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


17108 12-Jul-1996 bde

Don't use NULL in non-pointer contexts.


17004 08-Jul-1996 dyson

Back-off on the previous commit, specifically remove the look-ahead
optimization on the active queue scan. I will do this correctly later.


17003 08-Jul-1996 dyson

Fix a problem with the pageout daemon RSS limiting, where it degrades
performance to LRU or worse when RSS limiting takes effect. Also,
make an end condition in the active queue scan more efficient in the
case where pages are removed from the active queue as a side effect
of a pmap operation.


16993 07-Jul-1996 dg

In all special cases for spl or page_alloc where kmem_map is check for,
mb_map (a submap of kmem_map) must also be checked.
Thanks to wcarchive (err...sort of) for demonstrating this bug.


16892 02-Jul-1996 dyson

Properly set the PG_MAPPED and PG_WRITEABLE flags. This fixes some potential
problems with vm_map_remove/vm_map_delete.


16858 30-Jun-1996 dyson

Make -current consistant with -stable regarding time that a process
sleeps before being swapped out. The time is increased from 4 secs to
10 secs. Originally I had decreased it from 20 to 4, but that is a bit
severe. 20 is too long though.


16834 29-Jun-1996 dg

Make sure we have an object in the map entry before trying to trim pages
from it.


16750 26-Jun-1996 dyson

This commit does a couple of things:
Re-enables the RSS limiting, and the routine is now tail-recursive,
making it much more safe (eliminates the possiblity of kernel stack
overflow.) Also, the RSS limiting is a little more intelligent about
finding the likely objects that are pushing the process over the limit.

Added some sysctls that help with VM system tuning.

New sysctl features:
1) Enable/disable lru pageout algorithm.
vm.pageout_algorithm = 0, default algorithm that works
well, especially using X windows and heavy
memory loading. Can have adverse effects,
sometimes slowing down program loading.

vm.pageout_algorithm = 1, close to true LRU. Works much
better than clock, etc. Does not work as well as
the default algorithm in general. Certain memory
"malloc" type benchmarks work a little better with
this setting.

Please give me feedback on the performance results
associated with these.

2) Enable/disable swapping.
vm.swapping_enabled = 1, default.

vm.swapping_enabled = 0, useful for cases where swapping
degrades performance.

The config option "NO_SWAPPING" is still operative, and
takes precedence over the sysctl. If "NO_SWAPPING" is
specified, the sysctl still exists, but "vm.swapping_enabled"
is hard-wired to "0".

Each of these can be changed "on the fly."


16679 25-Jun-1996 dyson

Fix some serious problems with limits checking in the sbrk(2)/brk(2)
code.
Reviewed by: bde


16664 24-Jun-1996 dyson

Remove RSS limiting until I rewrite the code to be non-recursive. The
code can overrun the kernel stack under very stressful conditions.


16562 21-Jun-1996 dyson

Improve algorithm for page hash queue. It was previously about
as bad as it could be. This algorithm appears to improve fork
performance (barely) measurably.


16415 17-Jun-1996 dyson

Several bugfixes/improvements:
1) Make it much less likely to miss a wakeup in vm_page_free_wakeup
2) Create a new entry point into pmap: pmap_ts_referenced, eliminates
the need to scan the pv lists twice in many cases. Perhaps there
is alot more to do here to work on minimizing pv list manipulation
3) Minor improvements to vm_pageout including the use of pmap_ts_ref.
4) Major changes and code improvement to pmap. This code has had
several serious bugs in page table page manipulation. In order
to simplify the problem, and hopefully solve it for once and all,
page table pages are no longer "managed" with the pv list stuff.
Page table pages are only (mapped and held/wired) or
(free and unused) now. Page table pages are never inactive,
active or cached. These changes have probably fixed the
hold count problems, but if they haven't, then the code is
simpler anyway for future bugfixing.
5) The pmap code has been sorely in need of re-organization, and I
have taken a first (of probably many) steps. Please tell me
if you have any ideas.


16409 16-Jun-1996 dyson

Various bugfixes/cleanups from me and others:
1) Remove potential race conditions on waking up in vm_page_free_wakeup
by making sure that it is at splvm().
2) Fix another bug in vm_map_simplify_entry.
3) Be more complete about converting from default to swap pager
when an object grows to be large enough that there can be
a problem with data structure allocation under low memory
conditions.
4) Make some madvise code more efficient.
5) Added some comments.


16377 14-Jun-1996 dg

Move a case of PG_MAPPED being set before a pmap_enter(). This will likely
make no difference, but it will make it consistent with other uses of
PG_MAPPED.


16324 12-Jun-1996 dyson

Fix a very significant cnt.v_wire_count leak in vm_page.c, and some
minor leaks in pmap.c. Bruce Evans made me aware of this problem.


16318 12-Jun-1996 dyson

Fix some serious errors in vm_map_simplify_entries.


16274 10-Jun-1996 dyson

Mostly superficial code improvements, add a diagnostic. The
code improvements include significant simplification of the reservation
of the swap pager control blocks for reads. Add a panic for an inconsistent
swap pager control block count.


16268 10-Jun-1996 dyson

Keep the vm_fault/vm_pageout from getting into an "infinite paging loop", by
reserving "cached" pages before waking up the pageout daemon. This will reserve
the faulted page, and keep the system from thrashing itself to death given
this condition.


16197 08-Jun-1996 dyson

Adjust the threshold for blocking on movement of pages from the cache
queue in vm_fault.

Move the PG_BUSY in vm_fault to the correct place.

Remove redundant/unnecessary code in pmap.c.

Properly block on rundown of page table pages, if they are busy.

I think that the VM system is in pretty good shape now, and the following
individuals (among others, in no particular order) have helped with this
recent bunch of bugs, thanks! If I left anyone out, I apologize!

Stephen McKay, Stephen Hocking, Eric J. Chet, Dan O'Brien, James Raynard,
Marc Fournier.


16122 05-Jun-1996 dyson

Keep page-table pages from ever being sensed as dirty. This should fix
some problems with the page-table page management code, since it can't
deal with the notion of page-table pages being paged out or in transit.
Also, clean up some stylistic issues per some suggestions from
Stephen McKay.


16058 01-Jun-1996 dyson

Disable madvise optimizations for device pager objects (some of the
operations don't work with FICTITIOUS pages.) Also, close a window
between PG_MANAGED and pmap_enter that can mess up the accounting of
the managed flag. This problem could likely cause a hold_count error
for page table pages.


16026 31-May-1996 dyson

This commit is dual-purpose, to fix more of the pageout daemon
queue corruption problems, and to apply Gary Palmer's code cleanups.
David Greenman helped with these problems also. There is still
a hang problem using X in small memory machines.


15980 29-May-1996 dyson

Correct some unfortunately chosen constants, otherwise, not enough
pages are calculated for deferred allocation of swap pager data structures.
This is a follow-on to the previous commit to this file.


15979 29-May-1996 dyson

After careful review by David Greenman and myself, David had found a
case where blocking can occur, thereby giving other process's a chance
to modify the queue where a page resides. This could cause numerous
process and system failures.


15978 29-May-1996 dyson

Make sure that pageout deadlocks cannot occur. There is a problem
that the datastructures needed to support the swap pager can take
enough space to fully deplete system memory, and cause a deadlock.
This change keeps large objects from being filled with dirty pages
without the appropriate swap pager datastructures. Right now,
default objects greater than 1/4 the size of available system memory
are converted to swap objects, thereby eliminating the risk of deadlock.


15905 26-May-1996 dyson

Fix a couple of problems in the pageout_scan routine. First, there is
a condition when blocking can occur, and the daemon did not check properly
for a page remaining on the expected queue. Additionally, the inactive
target was being set much too large for small memory machines. It is now
being calculated based upon the amount of user memory available on every
pageout daemon run. Another problem was that if memory was very low, the
pageout daemon could fail repeatedly to traverse the inactive queue.


15904 26-May-1996 dyson

I think this covers (fixes) the last batch of freeing active/held/busy page
problem. BY MISTAKE, the vm_page_unqueue (or equiv) was removed from the
vm_fault code. Really bad things appear to happen if a page is on a queue
while it is being faulted.


15890 24-May-1996 dyson

Add an assert to vm_page_cache. We should never cache a dirty page.


15889 24-May-1996 dyson

Add apparently needed splvm protection to the active queue, and eliminate
an unnecessary test for dirty pages if it is already known to be dirty.


15888 24-May-1996 dyson

Eliminate inefficient check for dirty pages for pages in the PQ_CACHE
queue. Also, modify the MADV_FREE policy (it probably still isn't the final
version.)


15887 24-May-1996 dyson

Make the conversion from the default pager to swap pager more robust
in the face of low memory conditions.


15876 23-May-1996 dyson

Eliminate a vm_page_free, busy panic, in kern_malloc.


15873 23-May-1996 dyson

Initial support for MADV_FREE, support for pages that we don't care
about the contents anymore. This gives us alot of the advantage of
freeing individual pages through munmap, but with almost none of the
overhead.


15841 21-May-1996 dyson

After reviewing the previous commit to vm_object, the page protection
is never necessary, not just for PG_FICTICIOUS.


15836 21-May-1996 dyson

Don't protect non-managed pages off during object rundown. This fixes
a hang that occurs under certain circumstances when exiting X.


15819 19-May-1996 dyson

Initial support for mincore and madvise. Both are almost fully
supported, except madvise does not page in with MADV_WILLNEED, and
MADV_DONTNEED doesn't force dirty pages out.


15811 18-May-1996 dyson

One more file missing from the mega-commit. This inlines some very
simple routines in vm_page.c, so that an unnecessary subroutine call
is removed.


15810 18-May-1996 dyson

File mistakenly left out of the previous mega-commit. This provides
a global defn for 'exech_map.'


15809 18-May-1996 dyson

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


15722 10-May-1996 wollman

Allocate mbufs from a separate submap so that NMBCLUSTERS works as
expected.


15583 03-May-1996 phk

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


15543 02-May-1996 phk

removed:
CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei()
ptei() kvtopte() ptetov() ispt() ptetoav() &c &c
new:
NPDEPG

Major macro cleanup.


15534 02-May-1996 phk

KGDB is dead. It may come back one day if somebody does it.


15459 29-Apr-1996 dyson

Move the map entry allocations from the kmem_map to the kernel_map. As
a side effect, correct the associated object offset.


15367 24-Apr-1996 dyson

This fixes kmem_malloc/kmem_free (and malloc/free of objects of > 8K).
A page index was calculated incorrectly in vm_kern, and vm_object_page_remove
removed pages that should not have been.


15203 11-Apr-1996 bde

Fixed a spl hog. The vmdaemon process ran entirely at splhigh. It
sometimes disabled clock interrupts for 60 msec or more on a P133.
Clock interrupts were lost ...

Reviewed by: dyson


15153 09-Apr-1996 dyson

Reinstitute the map lock for processes being swapped out. This
is needed because of the vm_fault used to bring the page table page
for the kernel stack (UPAGES) back in. The consequence of the
previous incorrect change was a system hang.


15134 08-Apr-1996 dyson

Map lock checks not needed anymore for swapping out. We don't use
map operations for it anymore. Certain deadlocks should never happen
anymore.


15117 07-Apr-1996 bde

Removed never-used #includes of <machine/cpu.h>. Many were apparently
copied from bad examples.


15018 03-Apr-1996 dyson

Fixed a problem that the UPAGES of a process were being run down
in a suboptimal manner. I had also noticed some panics that appeared
to be at least superficially caused by this problem. Also, included
are some minor mods to support more general handling of page table page
faulting. More details in a future commit.


14900 29-Mar-1996 dg

Revert to previous calculation of vm_object_cache_max: it simply works
better in most real-world cases.


14882 28-Mar-1996 bde

Undid last revision. It duplicated part of second last revision.


14879 28-Mar-1996 scrappy

devfs_add_devsw() -> devfs_add_devswf modifications

Reviewed by: julian@freebsd.org


14866 28-Mar-1996 dyson

Add a function prototype for pmap_prefault.


14865 28-Mar-1996 dyson

VM performance improvements, and reorder some operations in VM fault
in anticipation of a fix in pmap that will allow the mlock system call to work
without panicing the system.


14864 28-Mar-1996 dyson

More map_simplify fixes from Alan Cox. This very significanly improves the
performance when the map has been chopped up. The map simplify operations
really work now.
Reviewed by: dyson
Submitted by: Alan Cox <alc@cs.rice.edu>


14854 27-Mar-1996 bde

Added drum device.

Submitted by: partly by "Marc G. Fournier" <scrappy@ki.net>


14693 19-Mar-1996 dyson

Fix the problem that unmounting filesystems that are backed by a VMIO
device have reference count problems. We mark the underlying object
ono-persistent, and account for the reference count that the VM system
maintainsfor the special device close. This should fix the removable
device problem.


14638 16-Mar-1996 dg

Force device mappings to always be shared. It doesn't make sense for them
to ever be COW and we need the mappings to be shared for backward
compatibilty.

Reviewed by: dyson


14610 13-Mar-1996 dyson

This commit is as a result of a comment by Alan Cox (alc@cs.rice.edu)
regarding the "real" problem with maps that we have been having
over the last few weeks. He noted that the first_free pointer was
left dangling in certain circumstances -- and he was right!!! This
should fix the map problems that we were having, and also give us the
advantage of being able to simplify maps more aggressively.


14589 12-Mar-1996 dyson

Fix the map corruption problem that appears as a u_map allocation
error.


14574 12-Mar-1996 dyson

Allow mmap'ed devices to work correctly across forks. The sanest
solution appeared to be to allow the child to maintain the same mapping as
the parent.


14531 11-Mar-1996 hsu

For Lite2: proc LIST changes.
Reviewed by: davidg & bde


14432 09-Mar-1996 dyson

Delay forking a process until there are more pages available. It was
possible to deadlock with the low threshold that we had used.


14431 09-Mar-1996 dyson

Modify a threshold for waking up the pageout daemon. Also, add a consistancy
check for making sure that held pages aren't freed (DG).


14430 09-Mar-1996 dyson

Add a missing initialization of the hold_count for device pager ficticiouse
pages.


14429 09-Mar-1996 dyson

Fix a calculation for a paging parameter.


14428 09-Mar-1996 dyson

Fix two problems:
The pmap_remove in vm_map_clean incorrectly unmapped the entire
map entry.
The new vm_map_simplify_entry code had an error (the offset
of the combined map entry was not set correctly.)
Submitted by: Alan Cox <alc@cs.rice.edu>


14427 09-Mar-1996 dyson

Set the page valid bits in fewer places, as opposed to being scattered
in various places.


14396 06-Mar-1996 dyson

Fix a problem in the swap pager that caused some of the pages that
were paged in under low swap space conditions to both loose their
backing store and their dirty bits. This would cause pages to
be demand zeroed under certain conditions in low VM space conditions
and consequential sig-11's or sig-10's. This situation was made
worse lately when the level for swap space reclaim threshold was
increased.


14366 04-Mar-1996 dyson

Fix a problem that pages in a mapped region were not always
properly invalidated. Now we traverse the object shadow chain
properly.


14364 03-Mar-1996 dyson

In order to fix some concurrency problems with the swap pager early
on in the FreeBSD development, I had made a global lock around the
rlist code. This was bogus, and now the lock is maintained on a
per resource list basis. This now allows the rlist code to be used for
almost any non-interrupt level application.


14360 03-Mar-1996 peter

Remove the #ifdef notyet from the prototype of vm_map_simplify. John
re-enabled the function but missed the prototype, causing a warning.


14325 02-Mar-1996 peter

Oops.. I nearly forgot the actual core of the length/rounding/etc fixes
that Bruce asked for.

These still are not quite perfect, and in particular, it can get
upset on extreme boundary cases (addr = 0xfff, len = 0xffffffff,
which would end up mapping a single page rather than failing), but
this is better code that I committed before.

(note, the VM system does not (apparently) support single mmap segment
sizes above 0x80000000 anyway)


14316 02-Mar-1996 dyson

1) Eliminate unnecessary bzero of UPAGES.
2) Eliminate unnecessary copying of pages during/after forks.
3) Add user map simplification.


14221 23-Feb-1996 peter

kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD


14178 22-Feb-1996 dg

Add a "NO_SWAPPING" option to disable swapping. This was originally done
to help diagnose a problem on wcarchive (where the kernel stack was
sometimes not present), but is useful in its own right since swapping
actually reduces performance on some systems (such as wcarchive).
Note: swapping in this context means making the U pages pageable and has
nothing to do with generic VM paging, which is unaffected by this option.

Reviewed by: <dyson>


14036 11-Feb-1996 dyson

Fixed a really bogus problem with msync ripping pages away from
objects before they were written. Also, don't allow processes
without write access to remove pages from vm_objects.


13909 04-Feb-1996 dyson

Changed vm_fault_quick in vm_machdep.c to be global. Needed for
new pipe code.


13790 31-Jan-1996 dg

"out of space" -> "out of swap space".


13788 31-Jan-1996 dg

Improved killproc() log message and made it and the other similar message
tolerant of p_ucred being invalid. Starting using killproc() where
appropriate.


13786 31-Jan-1996 dg

Print a more descriptive message when the mb_map is filled (out of mbuf
clusters), and tell the operator what to do about it (increase maxusers).


13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


13705 29-Jan-1996 dg

Added a check/panic for vm_map_find failing to find space for the page
tables/u-pages when forking. This is a "can't happen" case. :-)


13642 27-Jan-1996 bde

Added a `boundary' arg to vm_alloc_page_contig(). Previously the only
way to avoid crossing a 64K DMA boundary was to specify an alignment
greater than the size even when the alignment didn't matter, and for
sizes larger than a page, this reduced the chance of finding enough
contiguous pages. E.g., allocations of 8K not crossing a 64K boundary
previously had to be allocated on 8K boundaries; now they can be
allocated on any 4K boundary except (64 * n + 60)K.

Fixed bugs in vm_alloc_page_contig():
- the last page wasn't allocated for sizes smaller than a page.
- failures of kmem_alloc_pageable() weren't handled.

Mutated vm_page_alloc_contig() to create a more convenient interface
named contigmalloc(). This is the same as the one in 1.1.5 except
it has `low' and `high' args, and the `alignment' and `boundary'
args are multipliers instead of masks.


13628 25-Jan-1996 phk

Don't use %r, we havn't got it anymore.
Submitted by: bde


13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


13228 04-Jan-1996 wollman

Convert DDB to new-style option.


13226 04-Jan-1996 wollman

Convert SYSV IPC to new-style options. (I hope I got everything...)
The LKMs will need an extra file, to come later.


13223 04-Jan-1996 dg

Increased vm_object_cache_max by about 50% to yield better utilization of
memory when lots of small files are cached.

Reviewed by: dyson


13122 30-Dec-1995 peter

recording cvs-1.6 file death


12954 21-Dec-1995 julian

i386/i386/conf.c is no longer needed.. remove it from files.i386
redistribute a few last routines to beter places and shoot the file

I haven't act actually 'deleted' the file yet togive people time
to
have done a config.. I.e. they are likely to have done one in a week or so
so I'll remove it then..
it's now empty.
makes the question of a USL copyright rather moot.


12914 17-Dec-1995 dyson

Fix paging from ext2fs (and other fs w/block size < PAGE_SIZE). This
should fix kern/900.


12905 17-Dec-1995 bde

Cleaned up prototypes in pmap headers: removed ones for nonexistent
functions; moved misplaced ones; restored most of KNFish formatting
from 4.4lite version; removed bogus __BEGIN/END_DECLS.


12904 17-Dec-1995 bde

Fixed 1TB filesize changes. Some pindexes had bogus names and types
but worked because vm_pindex_t is indistinuishable from vm_offset_t.


12820 14-Dec-1995 phk

Another mega commit to staticize things.


12819 14-Dec-1995 phk

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


12813 13-Dec-1995 julian

devsw tables are now arrays of POINTERS to struct [cb]devsw
seems to work hre just fine though I can't check every file
that changed due to limmited h/w, however I've checked enught to be petty
happy withe hte code..

WARNING... struct lkm[mumble] has changed
so it might be an idea to recompile any lkm related programs


12808 13-Dec-1995 dyson

There was a bug that the size for an msync'ed region was not rounded
up. The effect of this was that msync with a size would generally sync
1 page less than it should. This problem was brought to my attention
by Darrel Herbst <dherbst@gradin.cis.upenn.edu> and Ron Minnich
<rminnich@sarnoff.com>.


12779 11-Dec-1995 dyson

Some new anti-deadlock code ended up messing up the paging stats. A modified
version of the code is now in place, and gausspage performance is back
up to where it should be.


12778 11-Dec-1995 dyson

Some DIAGNOSTIC code was enabled all of the time in error. The
diagnostic code is now conditional on #ifdef DIAGNOSTIC again.


12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


12737 10-Dec-1995 bde

Replaced nxdump by nodump (if the dump function gets called, then the
device must be configured, so ENXIO is a bogus errno).

Replaced zerosize by nopsize. zerosize was a temporary alias.


12726 10-Dec-1995 bde

Restored used includes of <vm/vm_extern.h>.


12710 10-Dec-1995 bde

Moved the declaration of boolean_t from <vm/vm_param.h> to
<sys/types.h> (if KERNEL is defined). This allows removing bogus
dependencies on vm stuff in several places (e.g., ddb) and stops
<vm_param.h> from depending on <vm_param.h>

Added declaration of boolean_t to <vm/vm.h> (if KERNEL is not
defined). It never belonged in <vm/vm_param.h>. Unfortunately,
it is required for some vm headers that are included by applications.

Deleted declarations of TRUE and FALSE from <vm/vm_param.h>. They
are defined in <sys/param.h> if KERNEL is defined and we'll soon
find out if any applications depend on them being defined in a vm
header.


12678 08-Dec-1995 phk

Julian forgot to make the *devsw structures static.


12675 08-Dec-1995 julian

Pass 3 of the great devsw changes
most devsw referenced functions are now static, as they are
in the same file as their devsw structure. I've also added DEVFS
support for nearly every device in the system, however
many of the devices have 'incorrect' names under DEVFS
because I couldn't quickly work out the correct naming conventions.
(but devfs won't be coming on line for a month or so anyhow so that doesn't
matter)

If you "OWN" a device which would normally have an entry in /dev
then search for the devfs_add_devsw() entries and munge to make them right..
check out similar devices to see what I might have done in them in you
can't see what's going on..
for a laugh compare conf.c conf.h defore and after... :)
I have not doen DEVFS entries for any DISKSLICE devices yet as that will be
a much more complicated job.. (pass 5 :)

pass 4 will be to make the devsw tables of type (cdevsw * )
rather than (cdevsw)
seems to work here..
complaints to the usual places.. :)


12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


12642 05-Dec-1995 bde

Moved the declaration of vm_object_t from <vm/vm.h> to <sys/types.h>
(if KERNEL is defined). This allows removing the #includes of vm
stuff in vnode_if.h, which will speed up the compilation of LINT by
about 5%.


12623 04-Dec-1995 phk

A major sweep over the sysctl stuff.

Move a lot of variables home to their own code (In good time before xmas :-)

Introduce the string descrition of format.

Add a couple more functions to poke into these marvels, while I try to
decide what the correct interface should look like.

Next is adding vars on the fly, and sysctl looking at them too.

Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.


12610 03-Dec-1995 bde

Fixed the type mismatch in check for the bogus mmap function `nullop'.
The test should never succeed and should go away. Temporarily print
a warning if it does succeed.


12591 03-Dec-1995 bde

Completed function declarations and/or added prototypes.

Staticized some functions.

__purified some functions. Some functions were bogusly declared as
returning `const'. This hasn't done anything since gcc-2.5. For
later versions of gcc, the equivalent is __attribute__((const)) at
the end of function declarations.


12569 02-Dec-1995 bde

Finished (?) cleaning up sysinit stuff.


12521 29-Nov-1995 julian

If you're going to mechanically replicate something in 50 files
it's best to not have a (compiles cleanly) typo in it! (sigh)


12517 29-Nov-1995 julian

OK, that's it..
That's EVERY SINGLE driver that has an entry in conf.c..
my next trick will be to define cdevsw[] and bdevsw[]
as empty arrays and remove all those DAMNED defines as well..

Each of these drivers has a SYSINIT linker set entry
that comes in very early.. and asks teh driver to add it's own
entry to the two devsw[] tables.

some slight reworking of the commits from yesterday (added the SYSINIT
stuff and some usually wrong but token DEVFS entries to all these
devices.

BTW does anyone know where the 'ata' entries in conf.c actually reside?
seems we don't actually have a 'ataopen() etc...

If you want to add a new device in conf.c
please make sure I know
so I can keep it up to date too..

as before, this is all dependent on #if defined(JREMOD)
(and #ifdef DEVFS in parts)


12453 21-Nov-1995 bde

Completed function declarations and/or added prototypes.


12423 20-Nov-1995 phk

Remove unused vars & funcs, make things static, protoize a little bit.


12325 16-Nov-1995 bde

Fixed recent staticizations. Some protypes for static functions were
left in headers and not staticized.


12300 14-Nov-1995 phk

staticize.


12286 14-Nov-1995 phk

Move all the VM sysctl stuff home where it belongs.


12259 13-Nov-1995 dg

Fixed up a comment and removed some #if 0'd code.


12226 12-Nov-1995 dg

Moved vm_map_lock call to inside the splhigh protection in vm_map_find().
This closes a probably rare but nonetheless real window that would result
in a process hanging or the system panicing.

Reviewed by: dyson, davidg
Submitted by: kato@eclogite.eps.nagoya-u.ac.jp (KATO Takenori)


12221 12-Nov-1995 bde

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


12206 11-Nov-1995 bde

Fixed type of obreak(). The args struct member name conflicted with
the (better) machine generated one in <sys/sysproto.h>.


12128 06-Nov-1995 dg

Initialize lock struct entries explicitly rather than calling bzero().


12118 06-Nov-1995 bde

Replaced bogus macros for dummy devswitch entries by functions.
These functions went away:

enosys (hasn't been used for some time)
enxio
enodev
enoioctl (was used only once, actually for a vop)

if_tun.c:
Continued cleaning up...

conf.h:
Probably fixed the type of d_reset_t. It is hard to tell the correct
type because there are no non-dummy device reset functions.

Removed last vestige of ambiguous sleep message strings.


12110 05-Nov-1995 dyson

Greatly simplify the msync code. Eliminate complications in vm_pageout
for msyncing. Remove a bug that manifests itself primarily on NFS
(the dirty range on the buffers is not set on msync.)


12006 02-Nov-1995 dg

Move page fixups (pmap_clear_modify, etc) that happen after paging input
completes out of vm_fault and into the pagers. This get rid of some
redundancy and improves the architecture.

Reviewed by: John Dyson <dyson>


11943 30-Oct-1995 bde

Don't pass an extra trailing arg to some functions.

Added the prototypes that found this bug.


11709 23-Oct-1995 dyson

Get rid of machine-dependent NBPG and replace with PAGE_SIZE.


11708 23-Oct-1995 dyson

Remove of now unused PG_COPYONWRITE.


11705 23-Oct-1995 dyson

First phase of removing the PG_COPYONWRITE flag, and an architectural
cleanup of mapping files.


11701 23-Oct-1995 dyson

Finalize GETPAGES layering scheme. Move the device GETPAGES
interface into specfs code. No need at this point to modify the
PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.


11621 21-Oct-1995 dyson

Implement mincore system call.


11576 19-Oct-1995 dg

Fix initialization of "bsize" in vnode_pager_haspage(). It must happen
after the check for the mount point still existing or else the system
will panic if someone forcibly unmounted the filesystem.


11526 16-Oct-1995 dyson

Remove an unnecessary tsleep in the swapin code. This tsleep
can defer swapping in processes and is just not the right thing to do.


11317 07-Oct-1995 dg

Fix argument passing to the "freeer" routine. Added some prototypes. (bde)
Moved extern declaration of swap_pager_full into swap_pager.h and out of
the various files that reference it. (davidg)

Submitted by: bde & davidg


11260 06-Oct-1995 phk

Avoid a 64bit divide.


11194 05-Oct-1995 bde

Fix pollution of application namespace by declarations of kernel
functions. The application header <sys/user.h> includes <vm/vm.h>
which includes <vm/lock.h>...

vm.h:
Don't include <machine/cpufunc.h>. It is already included by
<sys/systm.h> in the kernel and isn't designed to be included by
applications (the 2.1 version causes a syntax error in C++ and the
current version has initializers that are invalid in strict C++).

lock.h:
Only declare kernel functions if KERNEL is defined.


10989 24-Sep-1995 dyson

Perform more checking for proper loading of the UPAGES when a process
is swapped in. Also, remove unnecessary map locking/unlocking during
selection of processes to be swapped out.

This code might afford proper panics as opposed to spontaneous reboots
on certain systems. This should allow us to debug these problems better.


10988 24-Sep-1995 dyson

Significantly simplify the fault clustering code. After some analysis by
David Greenman, it has been determined that the more sophisticated code
only made a very minor difference in fault performance. Therefore, this
code eliminates some of the complication of the fault code, decreasing
the amount of CPU used to scan shadow chains.


10984 24-Sep-1995 dg

Check that the swap block is valid before including it in a cluster.

Submitted by: John Dyson


10835 17-Sep-1995 dg

Check the return value from vm_map_pageable() when mapping the process's
UPAGES and associated page table page. Panic on error. This is less than
optimial and will be fixed in the future, but is better than the old
behavior of panicing with a "kernel page directory invalid" in pmap_enter.


10728 14-Sep-1995 dyson

Fixed a typo in vm_fault_additional_pages.


10702 12-Sep-1995 dyson

Fix really bogus casting of a block number to a long. Also change the
comparison from a "< 0" to "== -1" like it should be.


10670 11-Sep-1995 dyson

Make sure that the prezero flag is cleared when needed.


10669 11-Sep-1995 dyson

Fix an error that can cause attempted reading beyond the end of file.


10668 11-Sep-1995 dyson

Code cleanup and minor performance improvement in the faultin cluster
code.


10653 09-Sep-1995 dg

Fixed init functions argument type - caddr_t -> void *. Fixed a couple of
compiler warnings.


10579 06-Sep-1995 dyson

Fixed a sign reversal problem -- might have cause some Sig-11s that
people have been seeing.


10576 06-Sep-1995 dyson

Minor performance improvements, additional prototype for additional
exported symbol.


10556 04-Sep-1995 dyson

Allow the fault code to use additional clustering info from both
bmap and the swap pager. Improved fault clustering performance.


10551 04-Sep-1995 dyson

Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count
for VOP_BMAP. Updated affected filesystems...


10548 03-Sep-1995 dyson

Machine independent changes to support pre-zeroed free pages. This
significantly improves demand-zero performance.


10544 03-Sep-1995 dyson

Added prototype for new routine "vm_page_set_validclean" and initial
declarations for the prezeroed pages mechanism.


10542 03-Sep-1995 dyson

New subroutine "vm_page_set_validclean" for a vfs_bio improvement.


10358 28-Aug-1995 julian

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


10345 26-Aug-1995 bde

Change vm_object_print() to have the correct number and type of args
for a ddb command.


10344 26-Aug-1995 bde

Change vm_map_print() to have the correct number and type of args for
a ddb command.


10080 16-Aug-1995 bde

Make everything except the unsupported network sources compile cleanly
with -Wnested-externs.


9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


9582 20-Jul-1995 dg

#if 0'd one of the DIAGNOSTIC checks in vm_page_alloc(). It was too
expensive for "normal" use.


9548 16-Jul-1995 dg

1) Merged swpager structure into vm_object.
2) Changed swap_pager internal interfaces to cope w/#1.
3) Eliminated object->copy as we no longer have copy objects.
4) Minor stylistic changes.


9514 13-Jul-1995 dg

Added a copyright to this file.


9513 13-Jul-1995 dg

Oops, forgot to add the "default" pager files...

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


9468 10-Jul-1995 dg

swapout_threads() -> swapout_procs().


9467 10-Jul-1995 dg

Increased global RSS limit to total RAM.


9456 09-Jul-1995 dg

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


9411 06-Jul-1995 dg

Fixed an object allocation race condition that was causing a "object
deallocated too many times" panic when using NFS.

Reviewed by: John Dyson


9356 28-Jun-1995 dg

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


9202 11-Jun-1995 rgrimes

Merge RELENG_2_0_5 into HEAD


8876 30-May-1995 rgrimes

Remove trailing whitespace.


8743 25-May-1995 dg

Removed check for sw_dev == NODEV; this is a normal condition for swap
over NFS and was gratuitously panicing when it happens.

Reviewed by: John Dyson
Submitted by: Pierre Beyssac via Poul-Henning Kamp


8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


8624 19-May-1995 dg

NFS diskless operation was broken because swapdev_vp wasn't initialized.
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().

Submitted by: Poul-Henning Kamp


8588 18-May-1995 dg

Fixed a bug that managed to slip in during Poul's dynamic swap partition
changes. The check for nswap was bogus, but the code was so convoluted
that it was difficult to tell. It's better now. :-)

Reviewed by: David Greenman (extensively), and John Dyson
Submitted by: Poul-Henning Kamp, w/tweaks by me.


8585 18-May-1995 dg

Accessing pages beyond the end of a mapped file results in internal
inconsistencies in the VM system that eventually lead to a panic. These
changes fix the behavior to conform to the behavior in SunOS, which is
to deny faults to pages beyond the EOF (returning SIGBUS). Internally,
this is implemented by requiring faults to be within the object size
boundaries. These changes exposed another bug, namely that passing in
an offset to mmap when trying to map an unnamed anonymous region also
results in internal inconsistencies. In this case, the offset is forced
to zero.

Reviewed by: John Dyson and others


8504 14-May-1995 dg

Changed swap partition handling/allocation so that it doesn't
require specific partitions be mentioned in the kernel config
file ("swap on foo" is now obsolete).

From Poul-Henning:

The visible effect is this:

As default, unless
options "NSWAPDEV=23"
is in your config, you will have four swap-devices.
You can swapon(2) any block device you feel like, it doesn't have
to be in the kernel config.

There is a performance/resource win available by getting the NSWAPDEV right
(but only if you have just one swap-device ??), but using that as default
would be too restrictive.

The invisible effect is that:

Swap-handling disappears from the $arch part of the kernel.
It gets a lot simpler (-145 lines) and cleaner.

Reviewed by: John Dyson, David Greenman
Submitted by: Poul-Henning Kamp, with minor changes by me.


8464 12-May-1995 phk

I'm about to jump on the swap-initialization, and having talked
with davidg about it, I hereby kill two undocumented misfeatures:
The code to skip a miniroot in the swapdev is not particular useful, and
if we need it we need it to be done properly, ie size the fs and skip all
of it not some hardcoded size, and subtract what we skip from the length
in the first place.
The SEQSWAP dies too. It's not the way to do it, it doesn't work, and
nobody have expressed any great desire for it to work. The way to
implement it correctly would be a second argument to swapon(2) to give
a priority/policy information. Low priority swapdevs can be made so
by adding them at a far offset (0x80000000 kind of thing), with almost no
modification to the strategy routine (in particular a offset per swapdev).
But until the need is obvious, it will not be done.


8416 10-May-1995 dg

Changed "handle" from type caddr_t to void *; "handle" is several different
types of pointers, and "char *" is a bad choice for the type.


8319 07-May-1995 dyson

Another error in the correction for trimming swap allocation for
small objects. (This code needs to be revisited.)


8315 07-May-1995 dyson

Fixed a calculation that would once-in-a-while cause the swap_pager
to emit spurious page outside of object type messages. It is not
a fatal condition anyway, so the message will be omitted for
release. Also, the code that "clips" the allocation size, associated
with the above problem, was fixed.


8216 02-May-1995 dg

Changed object hash list to be a list rather than a tailq. This saves
space for the hash list buckets and is a little faster. The features
of tailq aren't needed. Increased the size of the object hash table
to improve performance. In the future, this will be changed so that
the table is sized dynamically.


8059 25-Apr-1995 dg

Fixed a "bswbuf" hang caused by the wakeup in relpbuf() waking up the
wrong thing.


8010 23-Apr-1995 bde

inline -> __inline.

Headers should always use `__inline' for inline functions to avoid
syntax errors when modules that don't even use the offending functions
are compiled with `gcc -ansi'.


7968 21-Apr-1995 dyson

Fixed a problem in _vm_object_page_clean that could cause an
infinite loop.


7935 19-Apr-1995 dg

New flag: B_PAGING. Added as part of the vn driver hack.


7904 17-Apr-1995 dg

Fixed a logic bug that caused the vmdaemon to not wake up when intended.

Submitted by: John Dyson


7888 16-Apr-1995 dg

Removed obsolete/unused variable declarations. Killed externs and included
appropriate include files.


7887 16-Apr-1995 dg

Removed obsolete/unused variable declarations.
Removed some extern declarations and included the correct include files.


7883 16-Apr-1995 dg

Moved some zero-initialized variables into .bss. Made code intended to be
called only from DDB #ifdef DDB. Removed some completely unused globals.


7879 16-Apr-1995 dg

Removed gratuitous m->blah=0 assignments when initializing the vm_page
structs in vm_page_startup(). The vm_page structs are already completely
zeroed.


7873 16-Apr-1995 dg

Make "print_page_info" #ifdef DDB.


7870 16-Apr-1995 dg

Fixed a few bugs in vm_object_page_clean, mostly related to not syncing
pages that are in FS buffers. This fixes the (believed to already have been
fixed) problem with msync() not doing it's job...in other words, the
stuff that Andrew has continuously been complaining about.

Submitted by: John Dyson, w/minor changes by me.


7695 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


7430 28-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.


7427 28-Mar-1995 dg

Fixed typo...using wrong variable in page_shortage calculation.


7424 28-Mar-1995 dg

Fixed "pages freed by daemon" statistic (again).


7411 27-Mar-1995 dg

Explicitly set page dirty if this is a write fault - reduces calls to
pmap_is_modified() later.


7400 26-Mar-1995 dg

Removed some obsolete flags.

Submitted by: John Dyson


7366 25-Mar-1995 dg

Fix logic bug I just introduced with the flags to msync().


7365 25-Mar-1995 dg

Pass syncio flag to vm_object_clean(). It remains unimplemented, however.


7364 25-Mar-1995 dg

Disallow both MS_ASYNC and MS_INVALIDATE flags being set at the same time
in msync().


7360 25-Mar-1995 dg

Added "flags" argument to msync, and implemented MS_ASYNC and MS_INVALIDATE.
The MS_ASYNC flag doesn't current work, and MS_INVALIDATE will only toss out
the pages in the address space (not all pages in the shadow chain).


7352 25-Mar-1995 dg

Implemented cnt.v_reactivated and moved vm_page_activate() routine to
before vm_page_deactivate().


7350 25-Mar-1995 dg

Removed (almost) meaningless "object cache lookups/hits" statistic. In
our framework, these numbers will usually be nearly the same, and not
because of any sort of high 'hit rate'.


7346 25-Mar-1995 dg

Removed cnt.v_nzfod: In our current scheme of things it is not possible
to accurately track this. It isn't an indicator of resource consumption
anyway.
Removed cnt.v_kernel_pages: We don't implement this and doing so accurately
would be very difficult (and ambiguous - since process pages are often
double mapped in the kernel and the process address spaces).


7263 23-Mar-1995 dg

Fixed warning caused by returning a value in a void function (introduced
in a recent commit by me). Relaxed checks before calling vm_object_remove;
a non-internal object always has a pager.


7246 22-Mar-1995 dg

Removed unused fifth argument to vm_object_page_clean(). Fixed bug with
VTEXT not always getting cleared when it is supposed to. Added check to
make sure that vm_object_remove() isn't called with a NULL pager or for
a pager for an OBJ_INTERNAL object (neither of which will be on the hash
list). Clear OBJ_CANPERSIST if we decide to terminate it because of no
resident pages.


7243 22-Mar-1995 dg

Fixed potential sleep/wakeup race conditional with splhigh().

Submitted by: John Dyson


7240 22-Mar-1995 dg

Added a check for wrong object size; print a warning, but deal with it
correctly. The warning will tell us that there is a bug somewhere else
in sizing the object correctly.

Submitted by: John Dyson


7239 22-Mar-1995 dg

Fixed bug in vm_mmap() where the object that is created in some cases
was the wrong size. This is the likely cause of panics reported by
Lars Fredriksen and Paul Richards related to a -1 blkno when paging
via the swap_pager.

Submitted by: John Dyson


7236 21-Mar-1995 dg

Removed unused variable declaration missed in previous commit.


7235 21-Mar-1995 dg

Removed do-nothing VOP_UPDATE() call.


7215 21-Mar-1995 dg

Disallow non page-aligned file offsets in vm_mmap(). We don't support this
in either the high or low level parts of the VM system. Just return EINVAL
in this case, just like SunOS does.


7209 21-Mar-1995 dg

Fixed bug in the size == 0 case of msync() caused by a bogus return value
check..


7204 21-Mar-1995 dg

Added a new boolean argument to vm_object_page_clean that causes it to
only toss out clean pages if TRUE.


7187 20-Mar-1995 dg

Don't gain/lose an object reference in vnode_pager_setsize(). It will
cause vnode locking problems in vm_object_terminate().
Implement proper vnode locking in vm_object_terminate().


7185 20-Mar-1995 dg

Fixed "objde1" hang. It was caused by a "&" where an "&&" belonged in the
expression that decides if a wakeup should occur.


7180 20-Mar-1995 dg

Removed an unnecessary call to vinvalbuf after the page clean.


7178 19-Mar-1995 dg

Do proper vnode locking when doing paging I/O. Removed the asynchronous
paging capability to facilitate this (we saw little or no measureable
improvement with this anyway).

Submitted by: John Dyson


7170 19-Mar-1995 dg

Removed redundant newlines that were in some panic strings.


7162 19-Mar-1995 dg

Incorporated 4.4-lite vnode_pager_uncache() and vnode_pager_umount()
routines (and merged local changes). The changed vnode_pager_uncache
gets rids of the bogosity that you can call the routine without
having the vnode locked. The changed vnode_pager_umount properly locks
the vnode before calling vnode_pager_uncache.


7120 18-Mar-1995 dg

In vm_page_alloc_contig: Removed a redundant semicolon and used 'm' instead
of &pga[i] in one place.


7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


7066 15-Mar-1995 dg

Special cased the handling of mb_map in the M_WAITOK case. kmem_malloc()
now returns NULL and sets a global 'mb_map_full' when the map is full.
m_clalloc() has further been taught to expect this and do the right thing.
This should fix the "mb_map full" panics that several people have reported.


7029 12-Mar-1995 bde

Move a kernel inline function inside `#ifdef KERNEL' so that including
<vm/vm.h> doesn't cause warnings about nonexistent functions called
by the inline function. Clean up the formatting of the function.


7017 12-Mar-1995 dg

Fixed obsolete comment.


7016 12-Mar-1995 dg

Deleted vm_object_setpager().


7015 12-Mar-1995 dg

Deleted vm_object_setpager().


7014 12-Mar-1995 dg

Explicitly set object->flags = OBJ_CANPERSIST.


7008 11-Mar-1995 dg

Fix completely bogus comment.


7007 11-Mar-1995 dg

Clear OBJ_INTERNAL flag for device pager objects and named anonymous
objects.


6947 07-Mar-1995 dg

Set VAGE flag when pager is destroyed. This usually happens when an
object has fallen off the end of the cached list - this is likely the
last reference to the vnode and it should be reused before non file
vnodes that are already on the free list (VDIR mostly).


6944 07-Mar-1995 dg

Fixed object reference count problem that occurred in the MAP_PRIVATE
case after we rewrote vm_mmap(). Added some comments to make it easier
to follow the reference counts.


6943 07-Mar-1995 dg

Don't attempt to reverse collapse non OBJ_INTERNAL objects.


6897 04-Mar-1995 jkh

Remove a gratutious cast.


6816 01-Mar-1995 dg

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


6806 01-Mar-1995 dg

Slight change to include file order to accommodate upcoming changes.


6709 25-Feb-1995 bde

Don't use __P(()) in a function definition.


6703 25-Feb-1995 dg

Fixed severely broken printf (arguments out of order, no newline).


6673 23-Feb-1995 dg

Removed redundant HOLDRELE()'s.


6626 22-Feb-1995 dg

Changed return value from vnode_pager_addr to be in DEV_BSIZE units so
that 9 bits aren't lost in the conversion. Changed all callers to expect
this. This allows paging on large (>2GB) filesystems.

Submitted by: John Dyson


6625 22-Feb-1995 dg

vm_page.c:
Use request==VM_ALLOC_NORMAL rather than object!=kmem_object in deciding
if the caller is "important" in vm_page_alloc(). Also established a new
low threshold for non-interrupt allocations via cnt.v_interrupt_free_min.

vm_pageout.c:
Various algorithmic cleanup. Some calculations simplified. Initialize
cnt.v_interrupt_free_min to 2 pages.

Submitted by: John Dyson


6624 22-Feb-1995 dg

Just return in the case of a page not on any queue in vm_page_unqueue().
Return VM_PAGE_BITS_ALL even if size > PAGE_SIZE in vm_page_bits().

Submitted by: John Dyson


6623 22-Feb-1995 dg

Removed object locking code (it was a left over from an abortion that
was done a month or so ago).

Submitted by: John Dyson


6622 22-Feb-1995 dg

Removed bogus copy object collapse check (the idea is right, but the
spcific check was bogus).
Removed old copy of vm_object_page_clean and took out the #if 1 around
the remaining one.

Submitted by: John Dyson


6618 22-Feb-1995 dg

Only do object paging_in_progress wakeups if someone is waiting on this
condition.

Submitted by: John Dyson


6617 22-Feb-1995 dg

Rewrote MAP_PRIVATE case of vm_mmap() - all of the COW portion of this
routine was highly convoluted.

Submitted by: John Dyson


6601 21-Feb-1995 dg

Panic if u_map allocation fails.


6587 21-Feb-1995 dg

vm_extern.h: removed vm_allocate_with_pager.
Removed vm_user.c...it's now completely deprecated.


6585 21-Feb-1995 dg

Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_
pager(). Almost completely rewrote vm_mmap(); when John gets done with
the bottom half, it will be a complete rewrite. Deprecated most use of
vm_object_setpager(). Removed side effect of setting object persist
in vm_object_enter and moved this into the pager(s). A few other
cosmetic changes.


6584 21-Feb-1995 dg

Set page alloced for map entries as valid.


6582 20-Feb-1995 dg

Removed vm_allocate(), vm_deallocate(), and vm_protect() functions. The
only function remaining in this file is vm_allocate_with_pager(), and this
will be going RSN. The file will be removed when this happens.


6580 20-Feb-1995 dg

Moved ACT_MAX, ACT_ADVANCE, and ACT_DECLINE to vm_page.h.


6573 20-Feb-1995 dg

vm_inherit function has been deprecated.


6572 20-Feb-1995 dg

Stop using vm_allocate and vm_deallocate.


6571 20-Feb-1995 dg

VM for the kernel stack and page tables doesn't need to be explicitly
deallocated as it isn't inherited across the fork.
Use vm_map_find not vm_allocate.

Submitted by: John Dyson


6567 20-Feb-1995 dg

Panic if object is deallocated too many times.
Slight change to reverse collapsing so that vm_object_deallocate doesn't
have to be called recursively.
Removed half of a previous fix - the renamed page during a collapse doesn't
need to be marked dirty because the pager backing store pointers are copied
- thus preserving the page's data. This assumes that pages without backing
store are always dirty (except perhaps for when they are first zeroed, but
this doesn't matter).
Switch order of two lines of code so that the correct pager is removed
from the hash list. The previous code bogusly passed a NULL pointer to
vm_object_remove(). The call to vm_object_remove() should be unnecessary
if named anonymous objects were being dealt with correctly. They are
currently marked as OBJ_INTERNAL, which really screws up things (such as
this).


6566 20-Feb-1995 dg

Don't allow act_count to exceed ACT_MAX when bumping it up.
Small optimization to vm_page_bits().

Submitted by: John Dyson


6565 20-Feb-1995 dg

Fully initialize pages returned via vm_page_alloc_contig() so that the
memory can be later freed.


6541 18-Feb-1995 dg

1) Added protection against collapsing OBJ_DEAD objects.
2) bump reference counts by 2 instead of 1 so that an object deallocate
doesn't try to recursively collapse the object.
3) mark pages renamed during the collapse as dirty so that their contents
are preserved.

Submitted by: John and me.


6435 15-Feb-1995 dg

Don't bother calling pmap_create() when creating the temporary map.
The whole COW section of vm_mmap() should be rewritten; the current
implementation is highly convoluted.


6357 14-Feb-1995 phk

YF fix.


6356 14-Feb-1995 phk

YF Fix.


6351 14-Feb-1995 dg

Fixed problem with msync causing a panic.

Submitted by: John Dyson


6326 12-Feb-1995 dg

Carefully choose the value for vm_object_cache_max. The previous calculation
was rather bogus in most cases; the new value works very well for both
large and small memory machines.


6278 09-Feb-1995 dg

Killed MACHVMCOMPAT function prototypes as the functions don't exist.


6277 09-Feb-1995 dg

Killed MACHVMCOMPAT code. It doesn't compile, and in its present state
would require some work to make it not a serious security problem. It's
non-standard and not very useful anyway.


6258 09-Feb-1995 dg

Minor algorithmic adjustments that reduce the CPU consumption of the
pagedaemon in half while not reducing its effectiveness.

Submitted by: me & John


6151 03-Feb-1995 dg

Fixed bmap run-length brokeness.
Use bmap run-length extension when doing clustered paging.

Submitted by: John Dyson


6129 02-Feb-1995 dg

swap_pager.c:
Fixed long standing bug in freeing swap space during object collapses.
Fixed 'out of space' messages from printing out too often.
Modified to use new kmem_malloc() calling convention.
Implemented an additional stat in the swap pager struct to count the
amount of space allocated to that pager. This may be removed at some
point in the future.
Minimized unnecessary wakeups.

vm_fault.c:
Don't try to collect fault stats on 'swapped' processes - there aren't
any upages to store the stats in.
Changed read-ahead policy (again!).

vm_glue.c:
Be sure to gain a reference to the process's map before swapping.
Be sure to lose it when done.

kern_malloc.c:
Added the ability to specify if allocations are at interrupt time or
are 'safe'; this affects what types of pages can be allocated.

vm_map.c:
Fixed a variety of map lock problems; there's still a lurking bug that
will eventually bite.

vm_object.c:
Explicitly initialize the object fields rather than bzeroing the struct.
Eliminated the 'rcollapse' code and folded it's functionality into the
"real" collapse routine.
Moved an object_unlock() so that the backing_object is protected in
the qcollapse routine.
Make sure nobody fools with the backing_object when we're destroying it.
Added some diagnostic code which can be called from the debugger that
looks through all the internal objects and makes certain that they
all belong to someone.

vm_page.c:
Fixed a rather serious logic bug that would result in random system
crashes. Changed pagedaemon wakeup policy (again!).

vm_pageout.c:
Removed unnecessary page rotations on the inactive queue.
Changed the number of pages to explicitly free to just free_reserved
level.

Submitted by: John Dyson


5973 28-Jan-1995 dg

Completed the fix for attempting to page out pages via the device_pager.

Submitted by: John Dyson


5915 26-Jan-1995 dg

Use the VM_PAGE_BITS_ALL in a place it can be used.
Comment out call to pmap_prefault() until stability problems can be
thoroghly analyzed.


5903 25-Jan-1995 dg

Don't attempt to clean device_pager backed objects at terminate time.
There is similar bogusness in the pageout daemon that will be fixed soon.
This fixes a panic pointed out to me by Bruce Evans that occurs when
/dev/mem is used to map managed memory.


5841 24-Jan-1995 dg

Added ability to detect sequential faults and DTRT. (swap_pager.c)
Added hook for pmap_prefault() and use symbolic constant for new third
argument to vm_page_alloc() (vm_fault.c, various)
Changed the way that upages and page tables are held. (vm_glue.c)
Fixed architectural flaw in allocating pages at interrupt time that was
introduced with the merged cache changes. (vm_page.c, various)
Adjusted some algorithms to acheive better paging performance and to
accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c)
Fixed pbuf handling problem, changed policy on handling read-behind page.
(vnode_pager.c)

Submitted by: John Dyson


5636 15-Jan-1995 dg

Moved some splx's down a few lines in vm_page_insert and vm_page_remove
to make the locking a bit more clear - this change is currently a NOP
as the calls to those routines are already at splhigh().


5571 13-Jan-1995 dg

Protect a qcollapse call with an object lock before calling. The locks
need to be moved into the qcollapse and rcollapse routines, but I don't
have time at the moment to make all the required changes...this will do
for now.


5520 11-Jan-1995 dg

Improve my previous change to use the same tests as are used in qcollapse.


5519 11-Jan-1995 dg

Fixed a panic that Garrett reported to me...the OBJ_INTERNAL flag wasn't
being cleared in some cases for vnode backed objects; we now do this in
vnode_pager_alloc proper to guarantee it. Also be more careful in the
rcollapse code about messing with busy/bmapped pages.


5465 10-Jan-1995 dg

Kill VM_PAGE_INIT macro as it is only used once and makes the code more
difficult to understand. Got rid of unused vm_page flags.


5464 10-Jan-1995 dg

Fixed some formatting weirdness that I overlooked in the previous commit.


5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


5404 05-Jan-1995 dg

Make sure that the object being collapsed doesn't go away on us...by
gaining extra references to it.

Submitted by: John Dyson
Obtained from:


5348 02-Jan-1995 ats

Submitted by: Ben Jackson
just a missing newline in a kernel printf added.


5283 30-Dec-1994 bde

Clean up previous commits (format for 80 columns...).


5203 23-Dec-1994 dg

Do vm_page_rename more conservatively in rcollapse and qcollapse, and
change list walk so that it doesn't get stuck in an infinite loop.

Submitted by: John Dyson


5202 23-Dec-1994 dg

Initialize b_vnbuf.le_next before returning a new buffer in getpbuf and
trypbuf. Move a couple of splbio's to be slightly less conservative.


5186 22-Dec-1994 dg

Fixed a benign off by one error.


5166 19-Dec-1994 dg

Don't ever clear B_BUSY on a pbuf (or any other flag for that matter).
This appears to be the cause of some buffer confusion that leads to
a panic during heavy paging.

Submitted by: John Dyson


5151 18-Dec-1994 dg

Fixed multiple bogons with the map entry handling.


5146 18-Dec-1994 dg

Fixed bug where statically allocated map entries might be freed to the
malloc pool...causing a panic.

Submitted by: John Dyson


5145 18-Dec-1994 dg

Change swapping policy to be a bit more aggressive about finding a
candidate for swapout. Increased default RSS limit to a minimum of 2MB.


5114 15-Dec-1994 dg

Protect kmem_map modifications with splhigh() to work around a problem with
the map being locked at interrupt time.


5033 11-Dec-1994 dg

Don't put objects that have no parent on the reverse_shadow_list. Problem
identified and explained by Gene Stark (thanks Gene!).

Submitted by: John Dyson


4810 25-Nov-1994 dg

These changes fix a couple of lingering VM problems:

1. The pageout daemon used to block under certain
circumstances, and we needed to add new functionality
that would cause the pageout daemon to block more often.
Now, the pageout daemon mostly just gets rid of pages
and kills processes when the system is out of swap.
The swapping, rss limiting and object cache trimming
have been folded into a new daemon called "vmdaemon".
This new daemon does things that need to be done for
the VM system, but can block. For example, if the
vmdaemon blocks for memory, the pageout daemon
can take care of it. If the pageout daemon had
blocked for memory, it was difficult to handle
the situation correctly (and in some cases, was
impossible).

2. The collapse problem has now been entirely fixed.
It now appears to be impossible to accumulate unnecessary
vm objects. The object collapsing now occurs when ref counts
drop to one (where it is more likely to be more simple anyway
because less pages would be out on disk.) The original
fixes were incomplete in that pathological circumstances
could still be contrived to cause uncontrolled growth
of swap. Also, the old code still, under steady state
conditions, used more swap space than necessary. When
using the new code, users will generally notice a
significant decrease in swap space usage, and theoretically,
the system should be leaving fewer unused pages around
competing for memory.

Submitted by: John Dyson


4797 24-Nov-1994 dg

Don't try to page to a vnode that had it's filesystem unmounted.


4768 22-Nov-1994 dg

Preallocate the first swap block to work around a failure with swap starting
at physical block 0. Note that this will show up in pstat -s and swapinfo
as space "in use". In reality, the space is simply never made available.


4537 17-Nov-1994 dg

Don't ever try to kill off process 1 - even if we are out of swap space
and it's the candidate pig.


4534 17-Nov-1994 gibbs

Remove a peice of commented out code that was left over from the early
stages of debugging LFS:

* if we can't bmap, use old VOP code
*/
! if (/* (vp->v_mount && vp->v_mount->mnt_stat.f_type == MOUNT_LFS) || */
! VOP_BMAP(vp, foff, &dp, 0, 0)) {
for (i = 0; i < count; i++) {
if (i != reqpage) {
vnode_pager_freepage(m[i]);
--- 804,810 ----
/*
* if we can't bmap, use old VOP code
*/
! if (VOP_BMAP(vp, foff, &dp, 0, 0)) {

Reviewed by: gibbs
Submitted by: John Dyson


4461 14-Nov-1994 bde

pmap.h:
Disable the bogus declaration of pmap_bootstrap(). Since its arg list
is machine-dependent, it must be declared in a machine-dependent header.

vm_page.h:
Change `inline' to `__inline' and old-style function parameter lists for
inlined functions to new-style.

`inline' and old-style function parameter lists should never be used in
system headers, even in very machine-dependent ones, because they cause
warnings from gcc -Wreally-all.


4447 14-Nov-1994 dg

Set laundry flag when transitioning an inactive page from clean to dirty.
This fixes a performance bug where pages would sometimes not be paged
out when they could be.

Submitted by: John Dyson


4446 13-Nov-1994 dg

Fixed bug where a read-behind to a negative offset would occur if the
fault was at offset 0 in the object. This resulted in more overhead but
was othewise benign. Added incore() check in vnode_pager_has_page()
to work around a problem with LFS...other than slightly higher overhead,
this change has no affect on UFS.


4440 13-Nov-1994 dg

Fixed bugs in accounting of swap space that resulted in the pager thinking
it was out of space when it really wasn't.

Submitted by: John Dyson


4439 13-Nov-1994 dg

Implemented swap locking via P_SWAPPING flag. It was possible for a process
to be chosen for swap-in while it was being swapped-out. This was BAD.

Submitted by: John Dyson


4207 06-Nov-1994 dg

Fixed return status from pagers. Ahem...the previous method would manufacture
data when it couldn't get it legitimately. :-(

Submitted by: John Dyson


4203 06-Nov-1994 dg

Added support for starting the experimental "vmdaemon" system process.
Enabled via REL2_1.

Added support for doing object collapses "on the fly". Enabled via REL2_1a.

Improved object collapses so that they can happen in more cases. Improved
sensing of modified pages to fix an apparant race condition and improve
clustered pageout opportunities. Fixed an "oops" with not restarting page
scan after a potential block in vm_pageout_clean() (not doing this can result
in strange behavior in some cases).

Submitted by: John Dyson & David Greenman


3841 25-Oct-1994 dg

Improved I/O error reporting.


3839 25-Oct-1994 dg

#if 0'd out the object cache trimming code - there are multiple ways
that the pageout daemon can deadlock otherwise.

Submitted by: John Dyson


3815 23-Oct-1994 dg

Fixed object cache trimming policy so it actually works.

Submitted by: John Dyson


3814 23-Oct-1994 dg

Adjusted reserved levels to fix a deadlock condition.

Submitted by: John Dyson


3807 23-Oct-1994 dg

Changed a thread_sleep into an spl protected tsleep. A deadlock can occur
otherwise. Minor efficiency improvement in vm_page_free().

Submitted by: John Dyson


3798 22-Oct-1994 phk

Contrary to my last commit here: NFS-swap is enabled automatically.


3772 22-Oct-1994 dg

Fixed a comment from the previous commit.


3766 22-Oct-1994 dg

Various changes to allow operation without any swapspace configured. Note
that this is intended for use only in floppy situations and is done at
the sacrifice of performance in that case (in ther words, this is not the
best solution, but works okay for this exceptional situation).

Submitted by: John Dyson


3748 21-Oct-1994 phk

ATTENTION!

From now on, >all< swapdevices must be activated with "swapon".

If you havn't got it, add this line to /etc/fstab:
/dev/wd0b none swap sw 0 0
ne sec

Reason:
We want our GENERIC* kernels to have a large selection of swap-devices, but
on the other hand, we don't want to use a wd0b as swap when we boot of a
floppy. This way, we will never use a unexpected swapdevice. Nothing else
has changed.


3745 21-Oct-1994 wollman

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).

This involves fixing a few things I broke last time.


3692 18-Oct-1994 dg

Fix the remaining vmmeter counters. They all now work correctly.


3660 17-Oct-1994 dg

Put sanity check for negative hold count into #ifdef DIAGNOSTIC so that
it doesn't consume an extra 3k of kernel text because of gcc's bogus
inlining code.


3612 15-Oct-1994 dg

1) Some of the counters in the vmmeter struct don't fit well into the Mach VM
scheme of things, so I've changed them to be more appropriate. page in/ous
are now associated with the pager that did them. Nuked v_fault as the
only fault of interest that wouldn't be already counted in v_trap is a VM
fault, and this is counted seperately.
2) Implemented most of the remaining counters and corrected the counting of
some that were done wrong. They are all almost correct now...just a few
minor ones left to fix.


3611 15-Oct-1994 dg

Count vm faults as v_vm_fault, not v_fault.


3610 15-Oct-1994 dg

Properly count object lookups and hits.


3591 14-Oct-1994 dg

Got rid of redundant declaration warnings.


3587 14-Oct-1994 jkh

Add missing )'s to previous midnight changes. :-)


3573 14-Oct-1994 dg

Fixed bug where page modifications would be lost when swap space was
almost depleted.

Reviewed by: John Dyson


3572 14-Oct-1994 dg

Changed I/O error messages to be somewhat less cryptic. Removed a piece
of unused code.


3567 13-Oct-1994 dg

Fixed an object reference count problem that was caused by a call to
vm_object_lookup() being outside of some parens. The bug was introduced
via some recently added code.

Reviewed by: John Dyson


3451 09-Oct-1994 dg

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


3449 09-Oct-1994 phk

Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc.
Reviewed by: davidg


3446 09-Oct-1994 dg

Call resetpriority, not setpriority() ...oops.

Submitted by: John Dyson


3407 07-Oct-1994 phk

Cosmetics. Unused vars and other warnings.


3374 05-Oct-1994 dg

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


3373 05-Oct-1994 dg

Fixed minor bug caused by some missing parens that can result in slightly
reduced paging performance by missing a clustering opportunity. Found
by Poul-Henning Kamp with gcc -Wall.


3354 04-Oct-1994 dg

John Dyson's work in progress. Not currently used.


3347 04-Oct-1994 dg

Fixed bug related to proper sensing of page modification that we
inadvertantly introduced in pre-1.1.5. This could cause page modifications
to go unnoticed during certain extreme low memory/high paging rate conditions.

Submitted by: John Dyson and David Greenman


3311 02-Oct-1994 phk

GCC cleanup.
Reviewed by:
Submitted by:
Obtained from:


3154 27-Sep-1994 dg

Previous commit should have read ...in vm_page_alloc_contig().
...(this commit): moved initialization of 'start' to make it more clear
that it is initialized properly (also in vm_page_alloc_contig).
Reviewed by:
Submitted by:
Obtained from:


3153 27-Sep-1994 dg

Fixed another bug, and cleaned up the code.


3147 27-Sep-1994 dg

Fixed multiple bugs in previous version of vm_page_alloc_contig.


3145 27-Sep-1994 dg

1) New "vm_page_alloc_contig" routine by me.
2) Created a new vm_page flag "PG_FREE" to help track free pages.
3) Use PG_FREE flag to detect inconsistencies in a few places.


3103 25-Sep-1994 dg

Removed unimplemented subr_rmap.c and unused references to it.


3083 25-Sep-1994 dg

Disabled swap anti-fragmentation code. It reduces swap paging performance
by 20% in my tests, and it appears to be the cause of a swap leak.

Submitted by: John Dyson


2692 12-Sep-1994 dg

Fixed a bug I introduced when fixing the rss limit code. Changed swapout
policy to be a bit more selective about what processes get swapped out.

Reviewed by: John Dyson


2689 12-Sep-1994 dg

Eliminated a whole pile of ancient (we're taking 4.3BSD) VM system
related #define constants. Corrected incorrect VM_MAX_KERNEL_ADDRESS.

Reviewed by: John Dyson


2688 12-Sep-1994 dg

Don't deactivate pages in 0-refcount objects. Added a couple of missing
paging stats. Fixed problem with free_reserved becoming depleted during
certain swap_pager operations.

Submitted by: John Dyson, with a little help from me


2654 11-Sep-1994 dg

Fixed problem with no swap on boot device, but there is some on an
alternate device (as specified via kernel config file)...that casues
the machine to panic.


2524 06-Sep-1994 dg

Disabled a debugging printf.


2521 06-Sep-1994 dg

Simple changes to paging algorithms...but boy do they make a difference.
FreeBSD's paging performance has never been better. Wow.

Submitted by: John Dyson


2462 02-Sep-1994 dg

Whoops, accidently left out some pieces of the munmapfd patch.


2455 02-Sep-1994 dg

Removed all vestiges of tlbflush(). Replaced them with calls to pmap_update().
Made pmap_update an inline assembly function.


2413 30-Aug-1994 dg

Fixed bug caused by change of rlimit variables to quad_t's. The bug was in
using min() to calculate the minimum of rss_cur,rss_max - since these
are now quad_t's and min() takes u_ints...the comparison later for exceeding
the rss limit was always true - resulting in rather serious page thrashing.
Now using new qmin() function for this purpose.

Fixed another bug where PG_BUSY pages would sometimes be paged out (bad!).
This was caused by the PG_BUSY flag not being included in a comparison.


2386 29-Aug-1994 dg

Patches from John Dyson to improve swap code efficiency.
Religiously add back pmap_clear_modify() in vnode_pager_input until we figure
out why system performance isn't what we expect.

Submitted by: John Dyson (swap_pager) & David Greenman (vnode_pager)


2320 27-Aug-1994 dg

1) Changed ddb into a option rather than a pseudo-device (use options DDB
in your kernel config now).
2) Added ps ddb function from 1.1.5. Cleaned it up a bit and moved into its
own file.
3) Added \r handing in db_printf.
4) Added missing memory usage stats to statclock().
5) Added dummy function to pseudo_set so it will be emitted if there
are no other pseudo declarations.


2177 21-Aug-1994 paul

Made idempotent
Reviewed by:
Submitted by:


2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


1997 10-Aug-1994 dg

Fixed vm_page_deactivate to deal with getting called with a page that's
not on any queue. This is an old patchkit days fix.

Reviewed by: John Dyson and David Greenman
Submitted by: originally by Paul Mackerras


1974 09-Aug-1994 dg

Removed an old, obsolete call to vmmeter(). This is called now in the
schedcpu() routine in kern/kern_synch.c. This extra call to vmmeter() in
vm_glue.c was what was totally messing up the load average calculations.


1896 07-Aug-1994 dg

Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that
are no longer needed because of this.


1895 07-Aug-1994 dg

Provide support for upcoming merged VM/buffer cache, and fixed a few bugs
that haven't appeared to manifest themselves (yet).

Submitted by: John Dyson


1890 06-Aug-1994 dg

Fixed various prototype problems with the pmap functions and the subsequent
problems that fixing them caused.


1887 06-Aug-1994 dg

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


1885 06-Aug-1994 dg

Enabled page table preloading of cached objects.

Submitted by: John Dyson


1835 04-Aug-1994 dg

Added some code that was accidently left out early in the 1.x -> 2.0 VM
system conversion.
Submitted by: John Dyson


1827 04-Aug-1994 dg

Integrated VM system improvements/fixes from FreeBSD-1.1.5.


1817 02-Aug-1994 dg

Added $Id$


1810 01-Aug-1994 dg

Removed all code related to the pagescan daemon, and changed 'act_count'
adjustments to compensate for a world without the pagescan daemon.


1687 06-Jun-1994 dg

Don't move the page's position in the active queue if it is busy or
held. John has noticed some stability problems when doing this.


1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.