History log of /freebsd-current/sys/kern/kern_mbuf.c
Revision Date Author Comments
# 51346bd5 02-May-2024 John Baldwin <jhb@FreeBSD.org>

mbuf: Add EXT_CTL for mbufs backed by a CTL backend buffer

This is somewhat similar to EXT_NET_DRV, but CTL isn't a network
driver.

Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44725


# 71f8702f 08-Apr-2024 Gleb Smirnoff <glebius@FreeBSD.org>

mbuf: provide mc_get() that allocates struct mchain of given length

Implement m_getm2(), which is widely used via m_getm() macro, as a wrapper
around mc_get(). New code is advised to use mc_get().

Reviewed by: markj, tuexen
Differential Revision: https://reviews.freebsd.org/D44149


# a03c2393 09-Nov-2023 Alexander Motin <mav@FreeBSD.org>

uma: Improve memory modified after free panic messages

- Pass zone pointer to trash_ctor() and report zone name in the panic
message. It may be difficult to figyre out zone just by the item size.
- Do not pass user arguments to internal trash calls, pass thezone.
- Report malloc type name in the same unified panic message.
- Report corruption offset from the beginning of the items instead of
the full pointer. It makes panic message shorter and more readable.


# 6a88498e 09-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

mbuf: Add sysctl flag CTLFLAG_TUN to loader tunables

The following sysctl variables are actually loader tunables. Add sysctl
flag CTLFLAG_TUN to them so that `sysctl -T` will report them correctly.

1. kern.ipc.mb_use_ext_pgs
2. kern.ipc.nmbclusters
3. kern.ipc.nmbjumbop
4. kern.ipc.nmbjumbo9
5. kern.ipc.nmbjumbo16
6. kern.ipc.nmbufs

No functional change intended.

Reviewed by: kib, imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42113


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 64727619 01-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: use IfAPI in mbuf

Sponsored by: Juniper Networks, Inc.


# 1e6131ba 01-Feb-2023 Justin Hibbits <jhibbits@FreeBSD.org>

IfAPI: Add needed APIs for mbuf support

Summary:
Add 2 new APIs for supporting recent mbuf changes:
* 36e0a362ac added the m_snd_tag_alloc() wrapper around
if_snd_tag_alloc(). Push this down to the ifnet level.
* 4d7a1361ef adds the m_rcvif_serialize()/m_rcvif_restore() KPIs to
serialize and restore an ifnet pointer. Add the necessary wrapper to
get the index generation for this.

Reviewed By: jhb
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38340


# 440217b0 21-Sep-2022 Zhenlei Huang <zlei.huang@gmail.com>

debugnet: Fix parameter order in the calls to m_get()

Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D36643


# 840327e5 24-Aug-2022 Brooks Davis <brooks@FreeBSD.org>

mbuf: Don't support PAGE_SIZE < 4K

The Vax supported such things, but FreeBSD does not. This further
implies that MJUMPAGESIZE > MCLBYTES so assert this and remove code
handling them being equal.

Reviewed by: kp, imp, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D36320


# 81a34d37 17-Aug-2022 Gleb Smirnoff <glebius@FreeBSD.org>

protosw: retire pr_drain and use EVENTHANDLER(9) directly

The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164


# 4d88d81c 25-May-2022 Hans Petter Selasky <hselasky@FreeBSD.org>

mbuf(9): Implement a leaf network interface field in the mbuf packet header.

When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.

This field works similar to the existing "rcvif" mbuf packet header field.

Submitted by: jhb@
Reviewed by: gallatin@ and gnn@
Differential revision: https://reviews.freebsd.org/D35339
Sponsored by: NVIDIA Networking
Sponsored by: Netflix


# 613acc64 27-Jan-2022 Kristof Provost <kp@FreeBSD.org>

mbuf: do not restore dying interfaces

When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34076

(cherry picked from commit 703e533da5e2e4743d38bbf4605fec041bc69976)


# 4d7a1361 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif

Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef09187a

(cherry picked from commit e1882428dcbbafd2814d7e17b977a8f686784b39)


# 6c741ffb 03-May-2022 Marko Zec <zec@FreeBSD.org>

Revert "mbuf: do not restore dying interfaces"

This reverts commit 703e533da5e2e4743d38bbf4605fec041bc69976.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dcbbafd2814d7e17b977a8f686784b39.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex


# 703e533d 27-Jan-2022 Kristof Provost <kp@FreeBSD.org>

mbuf: do not restore dying interfaces

When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34076


# e1882428 26-Jan-2022 Gleb Smirnoff <glebius@FreeBSD.org>

ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif

Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef09187a


# c6c52d8e 26-Dec-2021 Alexander Motin <mav@FreeBSD.org>

kern: Remove CTLFLAG_NEEDGIANT from some more sysctls.

MFC after: 2 weeks


# 0a048d4a 09-Dec-2021 Mateusz Guzik <mjg@FreeBSD.org>

mbuf: plug set-but-not-used vars

Sponsored by: Rubicon Communications, LLC ("Netgate")


# b6cbbcae 16-Nov-2021 Kristof Provost <kp@FreeBSD.org>

m_get3(): actually use the selected zone

Reported by: markj


# 32854e52 16-Nov-2021 Mark Johnston <markj@FreeBSD.org>

mbuf: Properly set the default value for mb_use_ext_pgs

Reported by: Jenkins
Fixes: fcaa890c4469 ("mbuf: Only allow extpg mbufs if the system has a direct map")
Pointy hat: markj


# fcaa890c 16-Nov-2021 Mark Johnston <markj@FreeBSD.org>

mbuf: Only allow extpg mbufs if the system has a direct map

Some upcoming changes will modify software checksum routines like
in_cksum() to operate using m_apply(), which uses the direct map to
access packet data for unmapped mbufs. This approach of course does not
work on platforms without a direct map, so we have to disallow the use
of unmapped mbufs on such platforms.

I believe this is the right tradeoff: we only configure KTLS on amd64
and arm64 today (and one KTLS consumer, NFS TLS, requires a direct map
already), and the use of unmapped mbufs with plain sendfile is a recent
optimization. If need be, m_apply() could be modified to create
CPU-private mappings of extpg mbuf pages as a fallback.

So, change mb_use_ext_pgs to be hard-wired to zero on systems without a
direct map. Note that PMAP_HAS_DMAP is not a compile-time constant on
some systems, so the default value of mb_use_ext_pgs has to be
determined during boot.

Reviewed by: jhb
Discussed with: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32940


# a4667e09 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986


# c782ea8b 14-Sep-2021 John Baldwin <jhb@FreeBSD.org>

Add a switch structure for send tags.

Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572


# a051ca72 07-Aug-2021 Kristof Provost <kp@FreeBSD.org>

Introduce m_get3()

Introduce m_get3() which is similar to m_get2(), but can allocate up to
MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE).

This simplifies the bpf improvement in f13da24715.

Suggested by: glebius
Differential Revision: https://reviews.freebsd.org/D31455


# 10094910 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

uma: Add KMSAN hooks

For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO). Some zones are exempted from
this when it would otherwise raise false positives.

Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations. This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report. For example:
panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe

Sponsored by: The FreeBSD Foundation


# 0a718a6e 06-Jul-2021 Mateusz Guzik <mjg@FreeBSD.org>

mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw

Reviewed by: donner
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31082


# 05462bab 30-Jun-2021 Mateusz Guzik <mjg@FreeBSD.org>

mbuf: add m_free_raw to be used instead of directly calling uma_zfree

The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.

Reviewed by: kbowling
Discussed with: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30959


# fb32c8db 30-Jun-2021 Mateusz Guzik <mjg@FreeBSD.org>

iflib: retire MB_DTOR_SKIP

The flag was added in 2016 but remains unused.

Reviewed by: kbowling
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30958


# 52cd25eb 08-Jan-2021 Andrew Gallatin <gallatin@FreeBSD.org>

mbuf: enable ext_pgs ("unmapped") mbufs by default

Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile. Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.

Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.

Reviewed-by: jhb, glebius
Sponsored by: Netfix


# 36e0a362 29-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().

This gives a more uniform API for send tag life cycle management.

Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27000


# 56fb710f 06-Oct-2020 John Baldwin <jhb@FreeBSD.org>

Store the send tag type in the common send tag header.

Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).

In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.

Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 796d4eb8 31-Aug-2020 Andrew Gallatin <gallatin@FreeBSD.org>

make m_getm2() resilient to zone_jumbop exhaustion

When the zone_jumbop is exhausted, most things using
using sosend* (like sshd) will eventually
fail or hang if allocations are limited to the
depleted jumbop zone. This makes it imossible to
communicate with a box which is under an attach which
exhausts the jumbop zone.

Rather than depending on the page size zone, also try cluster
allocations to satisfy larger requests. This allows me
to ssh to, and serve 100Gb/s of traffic from a server which
under attack and has had its page-sized zone exhausted.

Reviewed by: glebius, markj, rmacklem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26150


# edde7a53 05-Aug-2020 Andrey V. Elsukov <ae@FreeBSD.org>

Add m__getjcl SDT probe.

Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC


# 03270b59 20-Jun-2020 Jeff Roberson <jeff@FreeBSD.org>

Use zone nomenclature that is consistent with UMA.


# 84d746de 09-Jun-2020 Rick Macklem <rmacklem@FreeBSD.org>

Add two functions that create M_EXTPG mbufs with anonymous pages.

These two functions are needed by nfs-over-tls, but could also be
useful for other purposes.
mb_alloc_ext_plus_pages() - Allocates a M_EXTPG mbuf and enough anonymous
pages to store "len" data bytes.
mb_mapped_to_unmapped() - Copies the data from a list of mapped (non-M_EXTPG)
mbufs into a list of M_EXTPG mbufs allocated with anonymous pages.
This is roughly the inverse of mb_unmapped_to_ext().

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D25182


# 61664ee7 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.2: start divorce of M_EXT and M_EXTPG

They have more differencies than similarities. For now there is lots
of code that would check for M_EXT only and work correctly on M_EXTPG
buffers, so still carry M_EXT bit together with M_EXTPG. However,
prepare some code for explicit check for M_EXTPG.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 365e8da4 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Mechanically rename MBUF_EXT_PGS_ASSERT() to M_ASSERTEXTPG() to match
classical M_ASSERTPKTHDR.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 6edfd179 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 4.1: mechanically rename M_NOMAP to M_EXTPG

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 7b6c99d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf
within m_epg namespace.
All edits except the 'struct mbuf' declaration and mb_dupcl() were done
mechanically with sed:

s/->m_ext_pgs.nrdy/->m_epg_nrdy/g
s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g
s/->m_ext_pgs.trail_len/->m_epg_trllen/g
s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g
s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g
s/->m_ext_pgs.flags/->m_epg_flags/g
s/->m_ext_pgs.record_type/->m_epg_record_type/g
s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g
s/->m_ext_pgs.tls/->m_epg_tls/g
s/->m_ext_pgs.so/->m_epg_so/g
s/->m_ext_pgs.seqno/->m_epg_seqno/g
s/->m_ext_pgs.stailq/->m_epg_stailq/g

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# bccf6e26 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# b363a438 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Make MBUF_EXT_PGS_ASSERT_SANITY() a macro, so that it prints file:line.
While here, stop using struct mbuf_ext_pgs.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# c4ee38f8 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.3: Rename mbuf_ext_pg_len() to m_epg_pagelen() that
uses mbuf argument.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# d90fe9d0 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Step 2.1: Build TLS workqueue from mbufs, not struct mbuf_ext_pgs.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 4c9f0f98 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

In mb_unmapped_compress() we don't need mbuf structure to keep data,
but we need buffer of MLEN bytes. This isn't just a simplification,
but important fixup, because previous commit shrinked sizeof(struct
mbuf) down below MSIZE, and instantiating an mbuf on stack no longer
provides enough data.

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 0c103266 02-May-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Continuation of multi page mbuf redesign from r359919.

The following series of patches addresses three things:

Now that array of pages is embedded into mbuf, we no longer need
separate structure to pass around, so struct mbuf_ext_pgs is an
artifact of the first implementation. And struct mbuf_ext_pgs_data
is a crutch to accomodate the main idea r359919 with minimal churn.

Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP.

The namespace for the newfeature is somewhat inconsistent and
sometimes has a lengthy prefixes. In these patches we will
gradually bring the namespace to "m_epg" prefix for all mbuf
fields and most functions.

Step 1 of 4:

o Anonymize mbuf_ext_pgs_data, embed in m_ext
o Embed mbuf_ext_pgs
o Start documenting all this entanglement

Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D24598


# 23feb563 14-Apr-2020 Andrew Gallatin <gallatin@FreeBSD.org>

KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.

While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by: hselasky, jhb, rrs, glebius (early version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24213


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 10c8fb47 04-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

uma: convert mbuf_jumbo_alloc to UMA_ZONE_CONTIG & tag others

Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.

Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23238


# 0017b2ad 03-Feb-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Couple protocol drain routines (frag6_drain and sctp_drain) may send
packets. An unexpected behaviour for memory reclamation routine.
Anyway, we need enter the network epoch for doing that.


# 727c6918 03-Jan-2020 Jeff Roberson <jeff@FreeBSD.org>

Use a separate lock for the zone and keg. This provides concurrency
between populating buckets from the slab layer and fetching full buckets
from the zone layer. Eliminate some nonsense locking patterns where
we lock to fetch a single variable.

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D22828


# 30be9685 04-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

mbuf zones: take out the trash

The mbuf zones were explicitly specifying the uma trash procedures on
zcreate, conditionally on INVARIANTS, because that used to be necessary
in order to get use-after-free checking for uma zones with non-empty
constructors or destructors. After r355137 uma automatically invokes
the trash constructor and destructor as long as no init and fini are
specified. This now allows the mbuf zones to pass their constructors
and destructors without needing to add on the uma trash procedures
conditionally.

Reviewed by: cem, jhb, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22583


# 7790c8c1 17-Oct-2019 Conrad Meyer <cem@FreeBSD.org>

Split out a more generic debugnet(4) from netdump(4)

Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport. It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4). Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c. The
separation is not perfect and future improvement is welcome. Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking. Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time. If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by: markj (earlier version)
Some discussion with: emaste, jhb
Objection from: marius
Differential Revision: https://reviews.freebsd.org/D21421


# b2dba663 27-Sep-2019 Andrew Gallatin <gallatin@FreeBSD.org>

kTLS: Fix a bug where we would not encrypt anon data inplace.

Software Kernel TLS needs to allocate a new destination crypto
buffer when encrypting data from the page cache, so as to avoid
overwriting shared clear-text file data with encrypted data
specific to a single socket. When the data is anonymous, eg, not
tied to a file, then we can encrypt in place and avoid allocating
a new page. This fixes a bug where the existing code always
assumes the data is private, and never encrypts in place. This
results in unneeded page allocations and potentially more memory
bandwidth consumption when doing socket writes.

When the code was written at Netflix, ktls_encrypt() looked at
private sendfile flags to determine if the pages being encrypted
where part of the page cache (coming from sendfile) or
anonymous (coming from sosend). This was broken internally at
Netflix when the sendfile flags were made private, and the
M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will
always be false for M_NOMAP mbufs, since one cannot just mtod()
them.

This change introduces a new flags field to the mbuf_ext_pgs
struct by stealing a byte from the tls hdr. Note that the current
header is still 2 bytes larger than the largest header we
support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON
when creating an unmapped mbuf in m_uiotombuf_nomap() (which is
the path that socket writes take), and we check for that flag in
ktls_encrypt() when looking for anon pages.

Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21796


# 08cfa56e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667


# b2e60773 26-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Add kernel-side support for in-kernel TLS.

KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277


# afa60c06 02-Jul-2019 John Baldwin <jhb@FreeBSD.org>

Invoke ext_free function when freeing an unmapped mbuf.

Fix a mis-merge when extracting the unmapped mbuf changes from
Netflix's in-kernel TLS changes where the call to the function that
freed the backing pages from an unmapped mbuf was missed.

Sponsored by: Chelsio Communications


# cec06a3e 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add support for using unmapped mbufs with sendfile(2).

This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl.
It is disabled by default.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# 82334850 28-Jun-2019 John Baldwin <jhb@FreeBSD.org>

Add an external mbuf buffer type that holds multiple unmapped pages.

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616


# fb3bc596 24-May-2019 John Baldwin <jhb@FreeBSD.org>

Restructure mbuf send tags to provide stronger guarantees.

- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.

Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.

To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.

For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.

- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.

This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).

In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.

To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.

- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117


# eec189c7 31-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually do
anything except several assertions. This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed. Such mbuf can not be stored or reallocated, its
life time is current context.


# 58c43838 27-Jan-2019 Andriy Voskoboinyk <avos@FreeBSD.org>

m_getm2: correct a comment.

The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.

Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.

PR: 134335
MFC after: 12 days


# 0d1467b1 11-Nov-2018 Conrad Meyer <cem@FreeBSD.org>

netdump: Fix netdumping with INVARIANTS kernels

Correct boneheaded assertion I added in r339501. Mea culpa.

The intent is to notice when an M_WAITOK zone allocation would fail during
netdump, not to prevent all use of mbufs during netdump.

Reviewed by: markj
X-MFC-With: r339501
Differential Revision: https://reviews.freebsd.org/D17957


# 9978bd99 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Add malloc_domainset(9) and _domainset variants to other allocator KPIs.

Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418


# 3937ee75 20-Oct-2018 Conrad Meyer <cem@FreeBSD.org>

netdump: Zone mbufs should be allocated before dump

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17306


# 5475ca5a 05-May-2018 Mark Johnston <markj@FreeBSD.org>

Add an mbuf allocator for netdump.

The aim is to permit mbuf allocations after a panic without calling into
the page allocator, without imposing any runtime overhead during regular
operation of the system, and without modifying driver code. The approach
taken is to preallocate a number of mbufs and clusters, storing them
in linked lists, and using the lists to back some UMA cache zones. At
panic time, the mbuf and cluster zone pointers are overwritten with
those of the cache zones so that the mbuf allocator returns
preallocated items.

Using this scheme, drivers which cache mbuf zone pointers from
m_getzone() require special handling when implementing netdump support.

Reviewed by: cem (earlier version), julian, sbruno
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15251


# c2ba2d1b 05-May-2018 Mark Johnston <markj@FreeBSD.org>

Style.

MFC after: 3 days


# ab3185d1 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 9c82bec4 09-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Improvements to sendfile(2) mbuf free routine.

o Fall back to default m_ext free mech, using function pointer in
m_ext_free, and remove sf_ext_free() called directly from mbuf code.
Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
SF_SYNC flag. Lack of the flag allows us not to dereference
ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
multi-page mbufs. For now compiler will inline it into
sendfile_free_mext().

In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615


# 07e87a1d 09-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext().

However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.

Provide longer comments in m_ext declaration to explain this change
and change from r296242.

In collaboration with: gallatin
Differential Revision: https://reviews.freebsd.org/D12615


# e8fd18f3 09-Oct-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Shorten list of arguments to mbuf external storage freeing function.

All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list. Not all functions need the second
argument, some don't even need the first one. The second argument
lives in next cache line, so not dereferencing it is a performance
gain. This was discovered in sendfile(2), which will be covered by
next commits.

The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.

Reviewed by: gallatin, kbowling
Differential Revision: https://reviews.freebsd.org/D12615


# 4c7070db 17-May-2016 Scott Long <scottl@FreeBSD.org>

Import the 'iflib' API library for network drivers. From the author:

"iflib is a library to eliminate the need for frequently duplicated device
independent logic propagated (poorly) across many network drivers."

Participation is purely optional. The IFLIB kernel config option is
provided for drivers that want to transition between legacy and iflib
modes of operation. ixl and ixgbe driver conversions will be committed
shortly. We hope to see participation from the Broadcom and maybe
Chelsio drivers in the near future.

Submitted by: mmacy@nextbsd.org
Reviewed by: gallatin
Differential Revision: D5211


# 17cd649f 17-May-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Add a comment and KASSERT that a M_NOFREE mbuf has always EXT_EXTREF ext.

Submitted by: kmacy


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# fddd4f62 26-Mar-2016 Navdeep Parhar <np@FreeBSD.org>

Plug leak in m_unshare.

m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE. The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by: gnn@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5698


# c8f59118 24-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Space and style(9) corrections for recent mbuf changes.


# 480f4e94 22-Mar-2016 George V. Neville-Neil <gnn@FreeBSD.org>

Add an mbuf provider to DTrace.

The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed. This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by: bz, markj
MFC after: 2 weeks
Sponsored by: Rubicon Communications (Netgate)
Differential Revision: https://reviews.freebsd.org/D5682


# 0ea37a86 01-Mar-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested: jhibbits


# 56a5f52e 29-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

New way to manage reference counting of mbuf external storage.

The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with: rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D5396
Sponsored by: Netflix


# 5e4bc63b 11-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c. This looks like
the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with: melifaro, olivier


# b4b12e52 10-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Garbage collect unused arguments of m_init().


# e60b2fcb 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with: jtl


# 9542ea7b 03-Feb-2016 Gleb Smirnoff <glebius@FreeBSD.org>

Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.


# 54503a13 19-Dec-2015 Jonathan T. Looney <jtl@FreeBSD.org>

Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks


# f2c2231e 31-Mar-2015 Ryan Stone <rstone@FreeBSD.org>

Fix integer truncation bug in malloc(9)

A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int. This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc. zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision: https://reviews.freebsd.org/D2106
Reviewed by: kib
Reported by: Michael Fuckner <michael@fuckner.net>
Tested by: Michael Fuckner <michael@fuckner.net>
MFC after: 2 weeks


# a9fa76f2 30-Sep-2014 Navdeep Parhar <np@FreeBSD.org>

Test for absence of M_NOFREE before attempting to purge the mbuf's tags.
This will leave more state intact should the assertion go off.

MFC after: 1 month


# 1fbe6a82 11-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Improve reference counting of EXT_SFBUF pages attached to mbufs.

o Do not use UMA refcount zone. The problem with this zone is that
several refcounting words (16 on amd64) share the same cache line,
and issueing atomic(9) updates on them creates cache line contention.
Also, allocating and freeing them is extra CPU cycles.
Instead, refcount the page directly via vm_page_wire() and the sfbuf
via sf_buf_alloc(sf_buf_page(sf)) [1].

o Call refcounting/freeing function for EXT_SFBUF via direct function
call, instead of function pointer. This removes barrier for CPU
branch predictor.

o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
copy m_ext in mb_dupcl().

[1] This has some problems for now. Using sf_buf_alloc() merely to
increase refcount is expensive, and is broken on sparc64. To be
fixed.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# fcc34a23 11-Jul-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix style bug: rename the refcount field of m_ext to ext_cnt, to match
other members.

Sponsored by: Nginx, Inc.


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# b7bd677f 04-Apr-2014 Ed Maste <emaste@FreeBSD.org>

Initialise m_pkthdr via bzero instead of explicitly zeroing each member

Sponsored by: The FreeBSD Foundation


# d251e700 10-Oct-2013 John Baldwin <jhb@FreeBSD.org>

Ignore attempts to set the nmbcluster sysctls to their current value
rather than failing with an error.

Reviewed by: andre
Approved by: re (delphij)
MFC after: 2 weeks


# b6f49c23 05-Sep-2013 Hiren Panchasara <hiren@FreeBSD.org>

Fixing a small typo.

Reviewed by: gjb
Approved by: sbruno (mentor)


# ce28636b 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

After r254779 "error" must always be present in mb_ctor_pack(),
not only when MAC is defined.

Reported by: gjb / tinderbox
Sponsored by: The FreeBSD Foundation


# ce6169e7 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Remove unused m_free_fast(). The difference to m_free() is only
2 predictable branches nowadays. However as a pre-condition the
caller had to ensure that the mbuf pkthdr did not have any mtags
attached to it, costing some potential branches again.

Sponsored by: The FreeBSD Foundation


# 1b4381af 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features. The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by: The FreeBSD Foundation


# 894734cb 24-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

dd a 24 bits wide ext_flags field to m_ext by reducing ext_type
to 8 bits. ext_type is an enumerator and the number of types we
have is a mere dozen.

A couple of ext_types are renumbered to fit within 8 bits.

EXT_VENDOR[1-4] and EXT_EXP[1-4] types for vendor-internal and
experimental local mapping.

The ext_flags field is currently unused but has a couple of flags
already defined for future use. Again vendor and experimental
flags are provided for local mapping.

EXT_FLAG_BITS is provided for the printf(9) %b identifier.

Initialize and copy ext_flags in the relevant mbuf functions.

Improve alignment and packing of struct m_ext on 32 and 64 archs
by carefully sorting the fields.


# afb295cc 23-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Avoid code duplication for mbuf initialization and use m_init() instead
in mb_ctor_mbuf() and mb_ctor_pack().


# 1227f20d 21-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Revert r254520 and resurrect the M_NOFREE mbuf flag and functionality.

Requested by: np, grehan


# aa3cb8fb 19-Aug-2013 Andre Oppermann <andre@FreeBSD.org>

Remove the unused M_NOFREE mbuf flag. It didn't have any in-tree users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING external mbuf
type together with a dedicated *ext_free function.

Discussed with: trociny, glebius


# 5df87b21 07-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 59b9c4f2 14-Jul-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5.

Now that r253351 moved sendfile() stats to a separate struct, the
last field used in mbstat is m_mcfail, which is updated, but never
read or obtained from userland.


# 05d1f5bc 15-Jul-2013 Andrey V. Elsukov <ae@FreeBSD.org>

Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with: glebius


# bc4a1b8c 10-Jul-2013 Andre Oppermann <andre@FreeBSD.org>

Make use of the fact that uma_zone_set_max(9) already returns the
rounded limit making a call to uma_zone_get_max(9) unnecessary.

MFC after: 1 day


# e0c00add 10-Jul-2013 Andre Oppermann <andre@FreeBSD.org>

Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and
expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem"
through sysctl.

Reported by: smh
MFC after: 1 day


# 4591f0d3 23-May-2013 Julian Elischer <julian@FreeBSD.org>

Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).


# 2ebcc8ac 24-Apr-2013 Andre Oppermann <andre@FreeBSD.org>

Base the calculation of maxmbufmem in part on kmem_map size
instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.

On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.

For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well. If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.

The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.

Found by: pho's new network stress test
Pointed out by: alc (kmem_map instead of kernel_map)
Tested by: pho


# 37140716 17-Jan-2013 Andre Oppermann <andre@FreeBSD.org>

Move the mbuf memory limit calculations from init_param2() to
tunable_mbinit() where it is next to where it is used later.

Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES
to SI_SUB_KMEM after the VM is running. This allows to use better
methods to determine the effectively available physical and virtual
memory available to the kernel.

Update comments.

In a second step it can be merged into mbuf_init().


# 6e0b6746 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Configure UMA warnings for the following zones:
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached

Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.

Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 45fe0bf7 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Make use of the fact that uma_zone_set_max(9) already returns actual limit set.


# 4007b61c 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

More style cleanups.


# b0b14025 07-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Style cleanups.


# 416a434c 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Complete r243631 by applying the remainder of kern_mbuf.c that got
lost while merging into the commit tree.

MFC after: 1 month
X-MFC-with: r243631


# ead46972 27-Nov-2012 Andre Oppermann <andre@FreeBSD.org>

Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.

The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.

At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.

Tidy up ordering in init_param2() and check up on some users of
those values calculated here.

Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.

Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.

The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.

NB: Every cluster also has an mbuf associated with it.

Two examples on the revised mbuf sizing limits:

1GB KVM:
512MB limit for mbufs
419,430 mbufs
65,536 2K mbuf clusters
32,768 4K mbuf clusters
9,709 9K mbuf clusters
5,461 16K mbuf clusters

16GB RAM:
8GB limit for mbufs
33,554,432 mbufs
1,048,576 2K mbuf clusters
524,288 4K mbuf clusters
155,344 9K mbuf clusters
87,381 16K mbuf clusters

These defaults should be sufficient for even the most demanding
network loads.

MFC after: 1 month


# 79f62ed6 09-Nov-2012 Alfred Perlstein <alfred@FreeBSD.org>

Allow maxusers to scale on machines with large address space.

Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis. Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks


# a2c36a02 29-Oct-2012 Kevin Lo <kevlo@FreeBSD.org>

Since the macro dtom() has been removed, fix comments about the dtom.

Reviewed by: glebius


# 812302c3 23-Aug-2012 Navdeep Parhar <np@FreeBSD.org>

Allow nmbjumbop, nmbjumbo9, and nmbjumbo16 to be set directly via loader
tunables.

MFC after: 1 month


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 60ae52f7 21-Jun-2010 Ed Schouten <ed@FreeBSD.org>

Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.


# 3153e878 12-Jul-2009 Alan Cox <alc@FreeBSD.org>

Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.

vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by: re (kib)


# e999111a 25-Jun-2009 Alan Cox <alc@FreeBSD.org>

This change is the next step in implementing the cache control functionality
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with: jhb


# 5b204a11 19-Jun-2009 Kip Macy <kmacy@FreeBSD.org>

define helper routines for deferred mbuf initialization


# c45c0034 18-Jun-2009 Alan Cox <alc@FreeBSD.org>

Utilize the new function kmem_alloc_contig() to implement the UMA back-end
allocator for the jumbo frames zones. This change has two benefits: (1) a
custom back-end deallocator is no longer required. UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames. Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 877e8812 01-Jan-2009 Robert Watson <rwatson@FreeBSD.org>

Temporary workaround for the limitations of the mbuf flowid field: zero
the field in the mbuf constructors, since otherwise we have no way to
tell if they are valid. In the future, Kip has plans to add a flag
specifically to indicate validity, which is the preferred model.


# 98bd6a19 18-Dec-2008 Ruslan Ermilov <ru@FreeBSD.org>

Removed a comment made obsolete by revisions 157927 and 174292.


# 62938659 10-Dec-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Make sure nmbclusters are initialized before maxsockets
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.

Reviewed by: rwatson, zec
Sponsored by: The FreeBSD Foundation
MFC after: 4 weeks


# 8aa7a581 08-Nov-2008 Kip Macy <kmacy@FreeBSD.org>

make kern.ipc.nmbclusters actually have a useful effect on nmbclusters et al.
initialize pkthdr in field order


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 7630c265 04-Apr-2008 Alan Cox <alc@FreeBSD.org>

Reintroduce UMA_SLAB_KMAP; however, change its spelling to
UMA_SLAB_KERNEL for consistency with its sibling UMA_SLAB_KMEM.
(UMA_SLAB_KMAP met its original demise in revision 1.30 of
vm/uma_core.c.) UMA_SLAB_KERNEL is now required by the jumbo frame
allocators. Without it, UMA cannot correctly return pages from the
jumbo frame zones to the VM system because it resets the pages' object
field to NULL instead of the kernel object. In more detail, the jumbo
frame zones are created with the option UMA_ZONE_REFCNT. This causes
UMA to overwrite the pages' object field with the address of the slab.
However, when UMA wants to release these pages, it doesn't know how to
restore the object field, so it sets it to NULL. This change teaches
UMA how to reset the object field to the kernel object.

Crashes reported by: kris
Fix tested by: kris
Fix discussed with: jeff
MFC after: 6 weeks


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# cf827063 01-Feb-2008 Poul-Henning Kamp <phk@FreeBSD.org>

Give MEXTADD() another argument to make both void pointers to the
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.

Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.

Likely break the only non-convetional user of the API, after informing
the relevant committer.

Update the mbuf(9) manual page, which was already out of sync on
this point.

Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.

This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.

This does not affect the memory layout or size of mbufs.

Parental oversight by: sam and rwatson.

No MFC is anticipated.


# 6b4959bf 15-Dec-2007 Randall Stewart <rrs@FreeBSD.org>

- fix tab to space issue, hmm maybe I should use vi.


# cf70a46b 05-Dec-2007 Randall Stewart <rrs@FreeBSD.org>

- Puts default limits on 4k/9k and 16k zones for mbufs all based
on 1/2 of each of the successive limits tied to the limit for
2k clusters.
- Adds real functionality in so that doing a sysctl to change these
actually changes them :-)

MFC after: 1 week


# ba63339a 04-Dec-2007 Alan Cox <alc@FreeBSD.org>

Introduce an UMA backend page allocator for the jumbo frame zones that
allocates physically contiguous memory.

MFC after: 3 months
Requested and reviewed by: Kip Macy
Tested by: Andrew Gallatin and Pyun YongHyeon


# ef44c8d2 26-Oct-2007 David E. O'Brien <obrien@FreeBSD.org>

style(9)


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 457869b9 06-Oct-2007 Kip Macy <kmacy@FreeBSD.org>

This patch adds an M_NOFREE flag which allows one to mark an mbuf as
not being independently freeable. This allows one to embed an mbuf in
the cluster itself. This confers the benefits of the packet zone on
all cluster sizes. Embedded mbufs currently suffer from the same
limitation that packet zone mbufs do in that one cannot disconnect
them and pass them around independently of the cluster. It would
likely be possible to eliminate this limitation in the future by
adding a second reference for the mbuf itself.

Approved by: re(gnn)


# 629b9e08 06-Oct-2007 Kip Macy <kmacy@FreeBSD.org>

Allow drivers to free an mbuf without having the mbuf be touched if
the driver has already freed any attached tags

Approved by: re(gnn)


# 041b706b 04-Jun-2007 David Malone <dwmalone@FreeBSD.org>

Despite several examples in the kernel, the third argument of
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.

Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.

In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.


# 0f4d9d04 04-Apr-2007 Kip Macy <kmacy@FreeBSD.org>

Fix mb_ctor_clust and mb_dtor_clust to reference the appropriate zone,
simplify setting refcnt

Reviewed by: andre, rwatson, and glebius
MFC after: 3 days


# 6c125b8d 24-Jan-2007 Mohan Srinivasan <mohans@FreeBSD.org>

Fix for problems that occur when all mbuf clusters migrate to the mbuf packet
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.

Reviewed by: andre@


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# a855e2b4 17-Sep-2006 Andre Oppermann <andre@FreeBSD.org>

Remove VLAN mtag UMA zones and initialize ether_vtag and tso_segsz packet
header fields to zero on mbuf allocation.

Sponsored by: TCP/IP Optimization Fundraise 2005


# b37ffd31 10-Jun-2006 Robert Watson <rwatson@FreeBSD.org>

Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month


# 4f590175 21-Apr-2006 Paul Saab <ps@FreeBSD.org>

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# a7bd90ef 08-Mar-2006 Andre Oppermann <andre@FreeBSD.org>

Properly handle the case when the packet secondary zone can't allocate
further mbuf clusters to attach to mbufs.

Reported by: kris
Tested by: kris
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 73bb09f2 27-Feb-2006 Gleb Smirnoff <glebius@FreeBSD.org>

One more grammar nit.

Submitted by: ru


# fcf90618 26-Feb-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Fix several typos and trim spaces at eol.

PR: kern/93759
Submitted by: Antoine Brodin <antoine.brodin laposte.net>


# ec63cb90 17-Feb-2006 Andre Oppermann <andre@FreeBSD.org>

Replace the 4k fixed sized jumbo mbuf clusters with PAGE_SIZE sized
jumbo mbuf clusters. To make the variable size clear they are named
MJUMPAGESIZE.

Having jumbo clusters with the native PAGE_SIZE is more useful than
a fixed 4k size according the device driver writers using this API.

The 9k and 16k jumbo mbuf clusters remain unchanged.

Requested by: glebius, gallatin
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 days


# 75ee267c 30-Jan-2006 Gleb Smirnoff <glebius@FreeBSD.org>

Merge the //depot/user/yar/vlan branch into CVS. It contains some collective
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.

The most important changes:

o Instead of global linked list of all vlan softc use a per-trunk
hash. The size of hash is dynamically adjusted, depending on
number of entries. This changes struct ifnet, replacing counter
of vlans with a pointer to trunk structure. This change is an
improvement for setups with big number of VLANs, several interfaces
and several CPUs. It is a small regression for a setup with a single
VLAN interface.
An alternative to dynamic hash is a per-trunk static array with
4096 entries, which is a compile time option - VLAN_ARRAY. In my
experiments the array is not an improvement, probably because such
a big trunk structure doesn't fit into CPU cache.
o Introduce an UMA zone for VLAN tags. Since drivers depend on it,
the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
This change is a big improvement for any setup utilizing vlan(4).
o Use rwlock(9) instead of mutex(9) for locking. We are the first
ones to do this! :)
o Some drivers can do hardware VLAN tagging + hardware checksum
offloading. Add an infrastructure for this. Whenever vlan(4) is
attached to a parent or parent configuration is changed, the flags
on vlan(4) interface are updated.

In collaboration with: yar, thompsa
In collaboration with: Mihail Balikov <mihail.balikov interbgc.com>


# fd241339 23-Jan-2006 Andre Oppermann <andre@FreeBSD.org>

In mb_zinit_pack() explicitly ignore the return value of uma_zalloc_arg().
The success of the cluster allocation is checked through a field in the
mbuf structure. This change is non-functional but properly silences code
inspection tools.

Found by: Coverity Prevent(tm)
Coverity ID: CID807
Sponsored by: TCP/IP Optimization Fundraise 2005


# 36ae3fd3 10-Dec-2005 Andre Oppermann <andre@FreeBSD.org>

Hide the 4k mbuf clusters if the normal clusters are defined to be
4k already.

This unbreaks tinderbox.

Submitted by: ru


# d5269a63 08-Dec-2005 Andre Oppermann <andre@FreeBSD.org>

Add an API for jumbo mbuf cluster allocation and also provide
4k clusters in addition to 9k and 16k ones.

struct mbuf *m_getjcl(int how, short type, int flags, int size)
void *m_cljget(struct mbuf *m, int how, int size)

m_getjcl() returns an mbuf with a cluster of the specified size attached
like m_getcl() does for 2k clusters.

m_cljget() is different from m_clget() as it can allocate clusters
without attaching them to an mbuf. In that case the return value
is the pointer to the cluster of the requested size. If an mbuf was
specified, it gets the cluster attached to it and the return value
can be safely ignored.

For size both take MCLBYTES, MJUM4BYTES, MJUM9BYTES, MJUM16BYTES.

Reviewed by: glebius
Tested by: glebius
Sponsored by: TCP/IP Optimization Fundraise 2005


# 49d46b61 06-Nov-2005 Gleb Smirnoff <glebius@FreeBSD.org>

Fix panic string in last revision.


# cd5bb63b 05-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Free only those mbuf+clusters back to the packet zone that were allocated
from there. All others get broken up and free'd individually to the mbuf
and cluster zones.

The packet zone is a secondary zone to the mbuf zone. There is currently
a limitation in UMA which prevents decreasing the packet zone stock when
the mbuf and cluster zone are drained and all their members are part of
packets. When this is fixed this change may be reverted.


# a5f77087 04-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Fix a logic error introduced with mandatory mbuf cluster refcounting and
freeing of mbufs+clusters back to the packet zone.


# 56a4e45a 02-Nov-2005 Andre Oppermann <andre@FreeBSD.org>

Mandatory mbuf cluster reference counting and groundwork for UMA
based jumbo 9k and jumbo 16k cluster support.

All mbuf's with external storage attached are mandatory reference
counted. For clusters and jumbo clusters UMA provides the refcnt
storage directly. It does not have to be separatly allocated. Any
other type of external storage gets its own refcnt allocated from
an UMA mbuf refcnt zone instead of normal kernel malloc.

The refcount API MEXT_ADD_REF() and MEXT_REM_REF() is no longer
publically accessible. The proper m_* functions have to be used.

mb_ctor_clust() and mb_dtor_clust() both handle normal 2K as well
as 9k and 16k clusters.

Clusters and jumbo clusters may be obtained without attaching it
immideatly to an mbuf. This is for high performance cluster
allocation in network drivers where mbufs are attached after the
cluster has been filled.

Tested by: rwatson
Sponsored by: TCP/IP Optimizations Fundraise 2005


# 32a6bd95 27-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

No longer maintain mbstat statistics for the mbuf allocator, UMA
statistics and libmemstat(3) are now used to track mbuf statistics.

MFC after: 1 month


# 68352adf 17-Jul-2005 Robert Watson <rwatson@FreeBSD.org>

Define four constants, MBUF_{,MEM,CLUSTER,PACKET,TAG}_MEM_NAME, which
are string names for their respective UMA zones and malloc types, and
are passed into uma_zcreate() and MALLOC_DEFINE(). Export them
outside of _KERNEL in mbuf.h so that netstat can reference them.

Change the names to improve consistency, with each zone/type
associated with the mbuf allocator being prefixed mbuf_.

MFC after: 1 week


# a7b844d2 29-Jun-2005 Mike Silbersack <silby@FreeBSD.org>

Fix the false memory modified after free messages some users have been
reporting - in my previous change, I missed the case where a mbuf
from the packet zone was freed back to the mbuf/packet keg, where
it was subsequently put into the mbuf zone and found not to contain
the expected trash. This change adds the necessary trash_dtor call inside
mb_fini_pack so that everything is correct.

Thanks for Bosko for finding the bug and showing me how secondary zones
work.

Approved by: re (dwhite)


# 121f0509 22-Jun-2005 Mike Silbersack <silby@FreeBSD.org>

Change the mbuf, mbuf cluster, and mbuf packet allocation routines so that
the UMA "trash" allocator is used - this ensures that any writes to a freed
mbuf should provoke a panic.

Only enabled under INVARIANTS, of course.

Approved by: re (scottl)


# 8076cb52 16-Feb-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Well, it seems that I pre-maturely removed the "All rights reserved"
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:

sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h


# 3d2a3ff2 10-Feb-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Optimize the way reference counting is performed with Mbufs. We
do not need to perform an extra memory fetch in the Packet (Mbuf+Cluster)
constructor to initialize the reference counter anymore. The reference
counts are located in a separate memory region (in the slab header,
because this zone is UMA_ZONE_REFCNT), so the memory fetch resulted very
often in a cache miss. Additionally, and perhaps more significantly,
optimize the free mbuf+cluster (packet) case, which is very common, to
no longer require an atomic operation on free (to verify the reference
counter) if the reference on the cluster has never been increased (also
very common). Reduces an atomic on mbuf free on average.

Original patch submitted by: Gerrit Nagelhout <gnagelhout@sandvine.com>


# 737cd952 31-Jan-2005 Bosko Milekic <bmilekic@FreeBSD.org>

Update copyright, remove "all rights reserved" (since they are not
all reserved, as the lisence makes clear), and strike the third clause
(now this is a 2-clause liberal BSDL as are the rest of files I hold
copyright over).


# a04946cf 20-Sep-2004 Brian Somers <brian@FreeBSD.org>

CTASSERT that MSZIE is a power of 2 (otherwise dtom() breaks)
Ask uma_zcreate() to align mbufs to MSIZE bytes (otherwise dtom() breaks)

As it happens, uma_zalloc_arg() always returned mbufs aligned to MSIZE
anyway, but that was an implementation side-effect....

KASSERT -> CTASSERT suggested by: dd@
Approved by: silence on -net


# b23f72e9 01-Aug-2004 Brian Feldman <green@FreeBSD.org>

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# 6bc72ab9 01-Jun-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Fix a couple of bugs in the mbuf and packet ctors. In the latter case,
nextpkt within the m_hdr was not being initialized to NULL for
!M_PKTHDR cases. *Maybe* this will fix weird socket buffer
inconsistency panics, but we'll see.


# 099a0e58 31-May-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)