History log of /freebsd-11-stable/sys/vm/vm_radix.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 327785 10-Jan-2018 markj

MFC r325530 (jeff), r325566 (kib), r325588 (kib):
Replace many instances of VM_WAIT with blocking page allocation flags.


# 321513 26-Jul-2017 kib

MFC r321247:
Add pctrie_init() and vm_radix_init() to initialize generic pctrie and
vm_radix trie.


# 318716 23-May-2017 markj

MFC r308474, r308691, r309203, r309365, r309703, r309898, r310720,
r308489, r308706:
Add PQ_LAUNDRY and remove PG_CACHED pages.


# 315472 18-Mar-2017 alc

MFC r309416
Eliminate a stale comment; vm_radix_prealloc() was replaced in r254141.


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 298482 22-Apr-2016 pfg

Cleanup redundant parenthesis from existing howmany()/roundup() macro uses.


# 267992 28-Jun-2014 hselasky

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 267985 27-Jun-2014 gjb

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 267961 27-Jun-2014 hselasky

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 263620 22-Mar-2014 bdrewery

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 259107 08-Dec-2013 alc

Eliminate a redundant parameter to vm_radix_replace().

Improve the wording of the comment describing vm_radix_replace().

Reviewed by: attilio
MFC after: 6 weeks
Sponsored by: EMC / Isilon Storage Division


# 254719 23-Aug-2013 alc

Addendum to r254141: The call to vm_radix_insert() in vm_page_cache() can
reclaim the last preexisting cached page in the object, resulting in a call
to vdrop(). Detect this scenario so that the vnode's hold count is
correctly maintained. Otherwise, we panic.

Reported by: scottl
Tested by: pho
Discussed with: attilio, jeff, kib


# 254141 09-Aug-2013 attilio

On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc (older version)
Reviewed by: jeff
Tested by: pho, scottl


# 250520 11-May-2013 alc

To reduce the amount of arithmetic performed in the various radix tree
functions, reverse the numbering scheme for the levels. The highest
numbered level in the tree now appears near the root instead of the leaves.

Sponsored by: EMC / Isilon Storage Division


# 250334 07-May-2013 alc

Remove a redundant call to panic() from vm_radix_keydiff(). The assertion
before the loop accomplishes the same thing.

Sponsored by: EMC / Isilon Storage Division


# 250259 04-May-2013 alc

Optimize vm_radix_lookup_ge() and vm_radix_lookup_le(). Specifically,
change the way that these functions ascend the tree when the search for a
matching leaf fails at an interior node. Rather than returning to the root
of the tree and repeating the lookup with an updated key, maintain a stack
of interior nodes that were visited during the descent and use that stack
to resume the lookup at the closest ancestor that might have a matching
descendant.

Sponsored by: EMC / Isilon Storage Division
Reviewed by: attilio
Tested by: pho


# 250018 28-Apr-2013 alc

Eliminate an unneeded call to vm_radix_trimkey() from vm_radix_lookup_le().
This call is clearing bits from the key that will be set again by the next
line.

Sponsored by: EMC / Isilon Storage Division


# 249986 27-Apr-2013 alc

Avoid some lookup restarts in vm_radix_lookup_{ge,le}().

Sponsored by: EMC / Isilon Storage Division


# 249745 21-Apr-2013 alc

Simplify vm_radix_{add,dec}lev().

Sponsored by: EMC / Isilon Storage Division


# 249605 18-Apr-2013 alc

When calculating the number of reserved nodes, discount the pages that will
be used to store the nodes.

Sponsored by: EMC / Isilon Storage Division


# 249502 15-Apr-2013 alc

Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we have previously created a level zero
interior node at the root of every non-empty trie, even when that node is
not strictly necessary, i.e., it has only one child. This change is the
second (and final) step in eliminating those unnecessary level zero interior
nodes. Specifically, it updates the deletion and insertion functions so
that they do not require a level zero interior node at the root of the trie.
For a "buildworld" workload, this change results in a 16.8% reduction in the
number of interior nodes allocated and a similar reduction in the average
execution time for lookup functions. For example, the average execution
time for a call to vm_radix_lookup_ge() is reduced by 22.9%.

Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division


# 249427 12-Apr-2013 alc

Although we perform path compression to reduce the height of the trie and
the number of interior nodes, we always create a level zero interior node at
the root of every non-empty trie, even when that node is not strictly
necessary, i.e., it has only one child. This change is the first step in
eliminating those unnecessary level zero interior nodes. Specifically, it
updates all of the lookup functions so that they do not require a level zero
interior node at the root.

Reviewed by: attilio, jeff (an earlier version)
Sponsored by: EMC / Isilon Storage Division


# 249221 06-Apr-2013 alc

Micro-optimize the order of struct vm_radix_node's fields. Specifically,
arrange for all of the fields to start at a short offset from the
beginning of the structure.

Eliminate unnecessary masking of VM_RADIX_FLAGS from the root pointer in
vm_radix_getroot().

Sponsored by: EMC / Isilon Storage Division


# 249211 06-Apr-2013 alc

Simplify vm_radix_keybarr().

Sponsored by: EMC / Isilon Storage Division


# 249182 06-Apr-2013 alc

Simplify vm_radix_insert().

Reviewed by: attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 249038 03-Apr-2013 alc

Replace the remaining uses of vm_radix_node_page() by vm_radix_isleaf() and
vm_radix_topage(). This transformation eliminates some unnecessary
conditional branches from the inner loops of vm_radix_insert(),
vm_radix_lookup{,_ge,_le}(), and vm_radix_remove().

Simplify the control flow of vm_radix_lookup_{ge,le}().

Reviewed by: attilio (an earlier version)
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 248728 26-Mar-2013 alc

Introduce vm_radix_isleaf() and use it in a couple places. As compared to
using vm_radix_node_page() == NULL, the compiler is able to generate one
less conditional branch when vm_radix_isleaf() is used. More use cases
involving the inner loops of vm_radix_insert(), vm_radix_lookup{,_ge,_le}(),
and vm_radix_remove() will follow.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# 248684 24-Mar-2013 alc

Micro-optimize the control flow in a few places. Eliminate a panic call
that could never be reached in vm_radix_insert(). (If the pointer being
checked by the panic call were ever NULL, the immmediately preceding loop
would have already crashed on a NULL pointer dereference.)

Reviewed by: attilio (an earlier version)
Sponsored by: EMC / Isilon Storage Division


# 248449 17-Mar-2013 attilio

Sync back vmcontention branch into HEAD:
Replace the per-object resident and cached pages splay tree with a
path-compressed multi-digit radix trie.
Along with this, switch also the x86-specific handling of idle page
tables to using the radix trie.

This change is supposed to do the following:
- Allowing the acquisition of read locking for lookup operations of the
resident/cached pages collections as the per-vm_page_t splay iterators
are now removed.
- Increase the scalability of the operations on the page collections.

The radix trie does rely on the consumers locking to ensure atomicity of
its operations. In order to avoid deadlocks the bisection nodes are
pre-allocated in the UMA zone. This can be done safely because the
algorithm needs at maximum one new node per insert which means the
maximum number of the desired nodes is the number of available physical
frames themselves. However, not all the times a new bisection node is
really needed.

The radix trie implements path-compression because UFS indirect blocks
can lead to several objects with a very sparse trie, increasing the number
of levels to usually scan. It also helps in the nodes pre-fetching by
introducing the single node per-insert property.

This code is not generalized (yet) because of the possible loss of
performance by having much of the sizes in play configurable.
However, efforts to make this code more general and then reusable in
further different consumers might be really done.

The only KPI change is the removal of the function vm_page_splay() which
is now reaped.
The only KBI change, instead, is the removal of the left/right iterators
from struct vm_page, which are now reaped.

Further technical notes broken into mealpieces can be retrieved from the
svn branch:
http://svn.freebsd.org/base/user/attilio/vmcontention/

Sponsored by: EMC / Isilon storage division
In collaboration with: alc, jeff
Tested by: flo, pho, jhb, davide
Tested by: ian (arm)
Tested by: andreast (powerpc)


# 248448 17-Mar-2013 attilio

Commit new file FreeBSD tags.

Sponsored by: EMC / Isilon storage division


# 248444 17-Mar-2013 alc

Fix a couple typos.

Sponsored by: EMC / Isilon Storage Division


# 248431 17-Mar-2013 alc

The M_ZERO can be eliminated from the uma_zalloc() call in
vm_radix_node_get() with a small change to vm_radix_reclaim_allnodes_int().
This change further reduced the average number of cycles per
vm_page_insert() call from 532 to 519.

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# 248428 17-Mar-2013 alc

Simplify the interface to vm_radix_insert() by eliminating the parameter
"index". The content of a radix tree leaf, or at least its "key", is not
opaque to the other radix tree operations. Specifically, they know how to
extract the "key" from a leaf. So, eliminating the parameter "index" isn't
breaking the abstraction. Moreover, eliminating the parameter "index"
effectively prevents the caller from passing an inconsistent "index" and
leaf to vm_radix_insert().

Reviewed by: attilio
Sponsored by: EMC / Isilon Storage Division


# 248424 17-Mar-2013 attilio

Expand ambiguous comments some more.

Requested by: alc


# 248225 12-Mar-2013 attilio

Fix compilation.

Sponsored by: EMC / Isilon storage division


# 248222 12-Mar-2013 attilio

Add a further safety belt to prevent inconsistencies.

Sponsored by: EMC / Isilon storage division
Submitted by: alc


# 248221 12-Mar-2013 attilio

For uniformity, use the user provided index.

Sponsored by: EMC / Isilon storage division
Reviewed and reported by: alc


# 248110 09-Mar-2013 attilio

Merge back vmc-playground into vmcontention.
vm_radix.{c, h} and _vm_radix.h are copied straight from the branch
to preserve history.


# 247969 07-Mar-2013 attilio

Improve comments.

Sponsored by: EMC / Isilon storage division
Submitted by: mdf


# 247773 04-Mar-2013 alc

Fix a typo.

Sponsored by: EMC / Isilon Storage Division


# 247771 04-Mar-2013 alc

Make a pass over most of the comments.


# 247770 04-Mar-2013 alc

Simplify Boolean expressions.

Sponsored by: EMC / Isilon Storage Division


# 247769 04-Mar-2013 alc

Fix spelling.

Sponsored by: EMC / Isilon Storage Division


# 247742 03-Mar-2013 attilio

Remove the boot-time cache support and rely on UMA boot-time slab cache
for allocating the nodes before to have the possibility to carve
directly from the UMA subsystem.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 247233 24-Feb-2013 attilio

Missing semicolon.

Sponsored by: EMC / Isilon storage division
Submitted by: alc
Pointy hat to: me


# 247232 24-Feb-2013 attilio

Simplify return logic.

Sponsored by: EMC / Isilon storage division
Submitted by: alc


# 247224 24-Feb-2013 attilio

Retire the old UMA primitive uma_zone_set_obj() and replace it with the
more modern uma_zone_reserve_kva(). The difference is that it doesn't
rely anymore on an obj to allocate pages and the slab allocator doesn't
use any more any specific locking but atomic operations to complete
the operation.
Where possible, the uma_small_alloc() is instead used and the uk_kva
member becomes unused.

The subsequent cleanups also brings along the removal of
VM_OBJECT_LOCK_INIT() macro which is not used anymore as the code
can be easilly cleaned up to perform a single mtx_init(), private
to vm_object.c.
For the same reason, _vm_object_allocate() becomes private as well.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc


# 247222 24-Feb-2013 attilio

Fix an inverted check that was reporting indexes wrongly detected
as wrapped.

Sponsored by: EMC / Isilon storage divison
Reported by: alc


# 246840 15-Feb-2013 attilio

On arches with VM_PHYSSEG_DENSE the vm_page_array is larger than
the actual number of vm_page_t that will be derived, so v_page_count
should be used appropriately.

Besides that, add a panic condition in case UMA fails to properly
restrict the area in a way to keep all the desired objects.

Sponsored by: EMC / Isilon storage division
Reported by: alc


# 246839 15-Feb-2013 attilio

Remove unused headers.


# 246837 15-Feb-2013 attilio

Fix comment.


# 246836 15-Feb-2013 attilio

Move the radix node zone destructor definition closer to
vm_radix_init() definition.

Sponsored by: EMC / Isilon storage division


# 246835 15-Feb-2013 attilio

- When panicing for "too small boot cache" reason, print the actual
cache size value
- Add a way to specify the size of the boot cache at compile time

Sponsored by: EMC / Isilon storage division


# 246834 15-Feb-2013 attilio

Improve dynamic branch prediction and i-cache utilization:
- Use predict_false() to tag boot-time cache decisions
- Compact boot-time cache allocation into a separate, non-inline,
function that won't be called most of the times.

Sponsored by: EMC / Isilon storage division


# 246795 14-Feb-2013 attilio

Fix style.


# 246794 14-Feb-2013 attilio

The radix preallocation pages can overfow the biggestone segment, so
use a different scheme for preallocation: reserve few KB of nodes to be
used to cater page allocations before the memory can be efficiently
pre-allocated by UMA.

This at all effects remove boot_pages further carving and along with
this modifies to the boot_pages allocation system and necessity to
initialize the UMA zone before pmap_init().

Reported by: pho, jhb


# 246730 13-Feb-2013 attilio

Grammar.

Sponsored by: EMC / Isilon storage division


# 246726 12-Feb-2013 attilio

Implement a new algorithm for managing the radix trie which also
includes path-compression. This greatly helps with sparsely populated
tries, where an uncompressed trie may end up by having a lot of
intermediate nodes for very little leaves.

The new algorithm introduces 2 main concepts: the node level and the
node owner. Every node represents a branch point where the leaves share
the key up to the level specified in the node-level (current level
excluded, of course). Such key partly shared is the one contained in
the owner. Of course, the root branch is exempted to keep a valid
owner, because theoretically all the keys are contained in the space
designed by the root branch node. The search algorithm seems very
intuitive and that is where one should start reading to understand the
full approach.

In the end, the algorithm ends up by demanding only one node per insert
and this is not necessary in all the cases. To stay safe, we basically
preallocate as many nodes as the number of physical pages are in the
system, using uma_preallocate(). However, this raises 2 concerns:
* As pmap_init() needs to kmem_alloc(), the nodes must be pre-allocated
when vm_radix_init() is currently called, which is much before UMA
is fully initialized. This means that uma_prealloc() will dig into the
UMA_BOOT_PAGES pool of pages, which is often not enough to keep track
of such large allocations.
In order to fix this, change a bit the concept of UMA_BOOT_PAGES and
vm.boot_pages. More specifically make the UMA_BOOT_PAGES an initial "value"
as long as vm.boot_pages and extend the boot_pages physical area by as
many bytes as needed with the information returned by
vm_radix_allocphys_size().
* A small amount of pages will be held in per-cpu buckets and won't be
accessible from curcpu, so the vm_radix_node_get() could really panic
when the pre-allocation pool is close to be exhausted.
In theory we could pre-allocate more pages than the number of physical
frames to satisfy such request, but as many insert would happen without
a node allocation anyway, I think it is safe to assume that the
over-allocation is already compensating for such problem.
On the field testing can stand me correct, of course. This could be
further helped by the case where we allow a single-page insert to not
require a complete root node.

The use of pre-allocation gets rid all the non-direct mapping trickery
and introduced lock recursion allowance for vm_page_free_queue.

The nodes children are reduced in number from 32 -> 16 and from 16 -> 8
(for respectively 64 bits and 32 bits architectures).
This would make the children to fit into cacheline for amd64 case,
for example, and in general spawn less cacheline, which may be
helpful in lookup_ge() case.
Also, path-compression cames to help in cases where there are many levels,
making the fallouts of such change less hurting.

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff (partially)
Tested by: flo


# 246430 06-Feb-2013 attilio

Cleanup vm_radix KPI:
- Avoid the return value for vm_radix_insert()
- Name the functions argument per-style(9)
- Avoid to get and return opaque objects but use vm_page_t as vm_radix is
thought to not really be general code but to cater specifically page
cache and resident cache.


# 245254 10-Jan-2013 attilio

Remove vm_radix_lookupn() and its usage in the kernel.


# 238245 08-Jul-2012 attilio

- Split the cached and resident pages tree into 2 distinct ones.
This makes the RED/BLACK support go away and simplifies a lot vmradix
functions used here. This happens because with patricia trie support
the trie will be little enough that keeping 2 diffetnt will be
efficient too.
- Reduce differences with head, in places like backing scan where the
optimizazions used shuffled the code a little bit around.

Tested by: flo, Andrea Barberio


# 236763 08-Jun-2012 attilio

Revert r231027 and fix the prototype for vm_radix_remove().
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.


# 236760 08-Jun-2012 attilio

Revert r236367.
The target of this is getting at the point where the recovery path is
completely removed as we could count on pre-allocation once the
path compressed trie is implemented.


# 236728 07-Jun-2012 attilio

Create a sub-branch for saving a temporary working version of vmcontention
and keep doing experiments.


# 236367 31-May-2012 attilio

Simplify insert path by using the same logic of vm_radix_remove() for
the recovery path. The bulk of vm_radix_remove() is put into a generic
function vm_radix_sweep() which allows 2 different modes (hard and soft):
the soft one will deal with half-constructed paths by cleaning them up.

Ideally all these complications should go once that a way to pre-allocate
is implemented, possibly by implementing path compression.

Requested and discussed with: jeff
Tested by: pho


# 235354 12-May-2012 attilio

Add braces.


# 235352 12-May-2012 attilio

On 32-bits architecture KTR has a bug as it cannot correctly grok
64-bits numbers. ktr_tracepoint() infacts casts all the passed value to
u_long values as that is what the ktr entries can handle.

However, we have to work a lot with vm_pindex_t which are always 64-bit
also on 32-bits architectures (most notable case being i386).

Use macros to split the 64 bits printing into 32-bits chunks which
KTR can correctly handle.

Reported and tested by: flo


# 235349 12-May-2012 attilio

- Fix a bug where lookupn can wrap up looking for the pages to scan,
returning a non correct very low address again.
- Stub out vm_lookup_foreach as it is not used


# 233034 16-Mar-2012 attilio

Fix the nodes allocator in architectures without direct-mapping:
- Fix bugs in the free path where the pages were not unwired and
relevant locking wasn't acquired.
- Introduce the rnode_map, submap of kernel_map, where to allocate from.
The reason is that, in architectures without direct-mapping,
kmem_alloc*() will try to insert the newly created mapping while
holding the vm_object lock introducing a LOR or lock recursion.
rnode_map is however a leafly-used submap, thus there cannot be any
deadlock.
Notes: Size the submap in order to be, by default, around 64 MB and
decrase the size of the nodes as the allocation will be much smaller
(and when the compacting code in the vm_radix will be implemented this
will aim for much less space to be used). However note that the
size of the submap can be changed at boot time via the
hw.rnode_map_scale scaling factor.
- Use uma_zone_set_max() covering the size of the submap.

Tested by: flo


# 232631 06-Mar-2012 attilio

Fix a compile time bug by adding a check just after the struct
definition


# 232326 29-Feb-2012 attilio

- Exclude vm_radix_shrink() from the interface but retain the code
still as it can be useful.
- Make most of the interface private as it is unnecessary public right
now. This will help in making nodes changing with arch and still avoid
namespace pollution.


# 231031 05-Feb-2012 flo

fix KTR consistency

I'm committing this on behalf of Attilio as he cannot access svn right now.


# 231027 05-Feb-2012 attilio

Remove the panic from vm_radix_insert() and propagate the error to the
callers of vm_page_insert().

The default action for every caller is to unwind-back the operation
besides vm_page_rename() where this has proven to be impossible to do.
For that case, it just spins until the page is not available to be
allocated. However, due to vm_page_rename() to be mostly rare (and
having never hit this panic in the past) it is tought to be a very
seldom thing and not a possible performance factor.

The patch has been tested with an atomic counter returning NULL from
the zone allocator every 1/100000 allocations. Per-printf, I've verified
that a typical buildkernel could trigger this 30 times. The patch
survived to 2 hours of repeated buildkernel/world.

Several technical notes:
- The vm_page_insert() is moved, in several callers, closer to failure
points. This could be committed separately before vmcontention hits
the tree just to verify -CURRENT is happy with it.
- vm_page_rename() does not need to have the page lock in the callers
as it hide that as an implementation detail. Do the locking internally.
- now vm_page_insert() returns an int, with 0 meaning everything was ok,
thus KPI is broken by this patch.


# 230750 29-Jan-2012 attilio

Fix a bug in vm_radix_leaf() where the shifting start address can
wrap-up at some point.
This bug is triggered very easilly by indirect blocks in UFS which grow
negative resulting in very high counts.

In collabouration with: flo


# 230749 29-Jan-2012 attilio

Fix format string for the pindex members as they should be treated
as uintmax_t for compatibility among 32/64 bits.


# 228314 06-Dec-2011 attilio

Use atomics for rn_count on leaf node because RED operations happen
without the VM_OBJECT_LOCK held, thus can be concurrent with BLACK ones.
However, also use a write memory barrier in order to not reorder the
operation of decrementing rn_count in respect fetching the pointer.

Discussed with: jeff


# 228312 06-Dec-2011 attilio

- Make rn_count 32-bits as it will naturally pad for 32-bit arches
- Avoid to use atomic to manipulate it at level0 because it seems
unneeded and introduces a bug on big-endian architectures where only
the top half (2 bits) of the double-words are written (as sparc64,
for example, doesn't support atomics at 16-bits) heading to a wrong
handling of rn_count.

Reported by: flo, andreast
Found by: marius
No answer by: jeff


# 228282 05-Dec-2011 andreast

Fix compilation issue on 32-bit targets.

Reviewed by: attilio


# 228216 02-Dec-2011 attilio

Revert a change that sneaked in during the last MFC.


# 228210 02-Dec-2011 attilio

MFC


# 228111 29-Nov-2011 attilio

- Remove unnecessary checks on rnode in KTR prints
- Track rn_count in KTR prints
- Improve KTR in a way it best fits rn_count tracking


# 228087 28-Nov-2011 attilio

Fix compile.

Submitted by: flo


# 228079 28-Nov-2011 attilio

Improve the diagnostic in the remove case.


# 227998 26-Nov-2011 attilio

Fix a bug when the 'rnode' pointer can be NULL and we try to track
the children. This helps in debugging case.

Reported by: flo


# 227754 20-Nov-2011 attilio

Add more KTR points for failure in vm_radix_insert().


# 226980 01-Nov-2011 attilio

vm_object_terminate() doesn't actually free the pages in the splay
tree.
Reclaim all the nodes related to the radix tree for a specified
vm_object when calling vm_object_terminate() via the newly added
interface vm_radix_reclaim_nodes().
The function is recursive, but we have a well-defined maximum depth,
thus the amount of necessary stack can be easilly calculated.

Reported by: alc
Discussed and reviewed by: jeff


# 226952 30-Oct-2011 jeff

- Extract part of vm_radix_lookupn() into a function that just finds the
first leaf page in a specified range. This permits us to make many
search & operate functions without much code duplication.
- Make a generic iterator for radix items.


# 226930 30-Oct-2011 jeff

- Support two types of nodes, red and black, within the same radix tree.
Black nodes support standard active pages and red nodes support cached
pages. Red nodes may be removed without the object lock but will not
collapse unused tree nodes. Red nodes may not be directly inserted,
instead a new function is supplied to convert between black and red.
- Handle cached pages and active pages in the same loop in vm_object_split,
vm_object_backing_scan, and vm_object_terminate.
- Retire the splay page handling as the ifdefs are too difficult to
maintain.
- Slightly optimize the vm_radix_lookupn() function.


# 226876 28-Oct-2011 jeff

- Use a single uintptr_t for the root of the radix node that encodes the
height and a pointer so that the update to the root is atomic. This
permits safe lookups in parallel with tree expansion. Shrinking the
space requirements is a small bonus.


# 226873 28-Oct-2011 attilio

Use an UMA zone for the radix node. This avoids the problem to check
for the kernel_map/kmem_map recursion because it uses direct mapping
provided by amd64 to avoid object and map search and recursion.

Probabilly all the others architectures using UMA_MD_SMALL_ALLOC are also
fixed by this, but other remains, where the most notable case is i386.
For it a solution has still to be determined. A way to do this would
be to have a reserved map just for radix node and mark all accesses to
its lock to be witness safe, but that would still be unoptimal due to
the large amount of virtual address space needed to cater the whole
tree.


# 226646 22-Oct-2011 jeff

- Implement vm_radix_lookup_le().
- Fix vm_radix_lookupn() when max == -1 by making the end parameter
inclusive.


# 226645 22-Oct-2011 attilio

Check in an intial implementation of radix tree implementation to replace
the vm object pages splay.

TODO:
- Handle differently the negative keys for having smaller depth
index nodes (negative keys caming from indirect blocks)
- Fix the get_node() by having support for a low reserved objects
directly from UMA
- Implement the lookup_le and re-enable VM_NRESERVELEVEL = 1
- Try to rework the superpage splay of idle pages and the cache splay
for every vm object in order to regain space on vm_page structure
- Verify performance and improve them (likely by having consumers to deal
with several ranges of pages manually?)

Obtained from: jeff, Mayur Shardul (GSoC 2009)