History log of /freebsd-current/sys/dev/nvme/nvme_private.h
Revision Date Author Comments
# d09ee08f 24-May-2024 Warner Losh <imp@FreeBSD.org>

nvme: Count number of alginment splits

When possible, we split up I/Os to NVMe drives that advertise a
preferred alignment. Add a counter for this.

Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D45311


# 1931b75e 22-Mar-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Export constants for min and max queue sizes

These are useful for NVMe over Fabrics.

Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44441


# 8d6c0743 06-Nov-2023 Alexander Motin <mav@FreeBSD.org>

nvme: Introduce longer timeouts for admin queue

KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.

Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454


# 9cd7b624 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: Eliminate RECOVERY_FAILED state

While it seemed like a good idea to have this state, we can do
everything we wanted with the state by checking ctrlr->is_failed since
that's set before we start failing the qpairs. Add some comments about
racing when we're failing the controller, though in practice I'm not
sure that kind of race could even be lost.

Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42051


# bc85cd30 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: gc nvme_ctrlr_post_failed_request and related task stuff

In 4b977e6dda92 we removed the call to nvme_ctrlr_post_failed_request
because we can now directly fail requests in this context since we're in
the reset task already. No need to queue it. I left it in place against
future need, but it's been two years and no panics have resulted. Since
the static analysis (code checking) and the dyanmic analysis (surviving
in the field for 2 years, including at $WORK where we know we've gone
through this path when we've failed drives) both signal that it's not
really needed, go ahead and GC it. If we discover at a later date a flaw
in this analysis, we can add it back easily enough by reverting this and
4b977e6dda92.

Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42048


# da8324a9 24-Sep-2023 Warner Losh <imp@FreeBSD.org>

nvme: Fix locking protocol violation to fix suspend / resume

Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.

Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.

Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.

Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.

Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.

These are all interrelated, so need to be done at the same time.

Sponsored by: Netflix
Reviewed by: jhb
Tested by: jhb (made sure suspend / resume worked)
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D41866


# 8052b01e 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add exclusion for ISR

Add a basically uncontended spinlock that we take out while the ISR is
running. This has two effects: First, when we get a timeout, we can
safely call the nvme_qpair_process_completions w/o racing any ISRs.
Second, we can use it to ensure that we don't reset the card while
the ISRs are active (right now we just sleep and hope for the best,
which usually is fine, but not always).

Sponsored by: Netflix
MFC After: 2 weeks
Reviewed by: chuck, gallatin
Differential Revision: https://reviews.freebsd.org/D41452


# d4959bfc 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Greatly improve error recovery

Next phase of error recovery: Eliminate the REOVERY_START phase, since
we don't need to wait to start recovery. Eliminate the RECOVERY_RESET
phase since it is transient, we now transition from RECOVERY_NORMAL into
RECOVERY_WAITING.

In normal mode, read the status of the controller. If it is in failed
state, or appears to be hot-plugged, jump directly to reset which will
sort out the proper things to do. This will cause all pending I/O to
complete with an abort status before the reset.

When in the NORMAL state, call the interrupt handler. This will complete
all pending transactions when interrupts are broken or temporarily
misbehaving. We then check all the pending completions for timeouts. If
we have abort enabled, then we'll send an abort. Otherwise we'll assume
the controller is wedged and needs a reset. By calling the interrupt
handler here, we'll avoid an issue with the current code where we
transitioned to RECOVERY_START which prevented any completions from
happening. Now completions happen. In addition and follow-on I/O that is
scheduled in the completion routines will be submitted, rather than
queued, because the recovery state is correct. This also fixes a problem
where I/O would timeout, but never complete, leading to hung I/O.

Resetting remains the same as before, just when we chose to reset has
changed.

A nice side effect of these changes is that we now do I/O when
interrupts to the card are totally broken. Followon commits will improve
the error reporting and logging when this happens. Performance will be
aweful, but will at least be minimally functional.

There is a small race when we're checking the completions if interrupts
are working, but this is handled in a future commit.

Sponsored by: Netflix
MFC After: 2 weeks
Differential Revision: https://reviews.freebsd.org/D36922


# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# 33469f10 14-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: use mtx_padaalign instead of mtx + alignment attribute

nvme driver predates, it seems, mtx_padalign. Modernize.

Sponsored by: Netflix


# 09c20a29 08-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Move bools to fill hole

The two bools in nvme_request create a 6 byte hole today. Move them to
after retires to fill the 4 byte hole there and add a spare[2] to make
nvme_request 8 bytes smaller. spare[2] isn't strictly necessary, but
documents how many bytes we have left in that hole, as the number of
booleans will increase shortly.

Suggested by: chuck
Sponsored by: Netflix


# 7be0b068 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate command printing routine

Both nvme_dump_command and nvme_qpair_print_command print nvme
commands. The former latter better. Recode the one call to
nvme_dump_command to use nvme_qpair_print_command and delete the
former. No sense having two nearly identical routines. A future commit
will convert to sbuf.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41309


# 6f76d493 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate completion printing routine

Both nvme_dump_completion and nvme_qpair_print_completion print
completions. The latter is better. Recode the two instances of
nvme_dump_completion to use nvme_qpair_print_completion and delete the
former. No sense having two nearly identical routines. A future commit
will convert this to sbuf.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D41308


# 92103adb 24-Jul-2023 John Baldwin <jhb@FreeBSD.org>

nvme: Use a memdesc for the request buffer instead of a bespoke union.

This avoids encoding CAM-specific knowledge in nvme_qpair.c.

Reviewed by: chuck, imp, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41119


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 1093caa1 06-May-2022 John Baldwin <jhb@FreeBSD.org>

nvme: Remove unused devclass arguments to DRIVER_MODULE.


# 3a468f20 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Use saved mps when initializing drive

Make sure we set the MPS we cached (currently the drives minimum mps) in
CC (Controller Configuration) when reinitializing the drive. It must
match the page_size that we're going to use. Also retire less specific
NVME_PAGE_SHIFT since it's now unused.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34869


# 55412ef9 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Rename min_page_size to page_size and save mps

The Memory Page Size sets the basic unit of operation for the drive. We
currently set this to the drive's minimum page size, but we could set it
to any page size the drive supports in the future. Replace min_page_size
(it's now unused for that purpose) with page_size to reflect this and
cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34868


# 6af6a52e 29-Mar-2022 Warner Losh <imp@FreeBSD.org>

nvme: Save cap_lo and cap_hi

Save the capabilities for the drive.

Sponsored by: Netflix


# a70b5660 29-Mar-2022 Warner Losh <imp@FreeBSD.org>

nvme: MPS is a power of two, not a size / 8k

Setting MPS in the CC should be a power of 2 number (it specifies the
page size of the host is 2^(12+MPS)), so adjust the calcuation. There is
no functional change because we do not support any architecutres != 4k
pages (yet). Other changes are needed for architectures with 16k or 64k
pages, especially when the underlying NVMe drive doesn't support that
page size (Most drives support a range that's small, and many only
support 4k), but let's at least do this calculation correctly. 12 - 12
is just as much 0 as 4096 >> 13 is :)

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D34707


# 7cf8d63c 06-Dec-2021 Warner Losh <imp@FreeBSD.org>

nvme_ahci: Mark AHCI devices as such in the controller

Add a quirk to flag AHCI attachment to the controller. This is for any
of the strategies for attaching nvme devices as children of the AHCI
device for Intel's RAID devices. This also has a side effect of cleaning
up resource allocation from failed nvme_attach calls now.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D33285


# 053f8ed6 06-Dec-2021 Warner Losh <imp@FreeBSD.org>

nvme: Move to a quirk for the Intel alignment data

Prior to NVMe 1.3, Intel produced a series of drives that had
performance alignment data in the vendor specific space since no
standard had been defined. Move testing the versions to a quick so the
NVMe NS code doesn't know about PCI device info.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D33284


# 83581511 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use adaptive spinning when polling for completion or state change

We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.

Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.

Tested by: cperciva
Sponsored by: Netflix
Reviewed by: mav
Differential Review: https://reviews.freebsd.org/D32259


# 587aa255 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: count number of ignored interrupts

Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.

Sponsored by: Netflix
Reviewed by: mav, chuck
Differential Revision: https://reviews.freebsd.org/D32212


# 502dc84a 23-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use shared timeout rather than timeout per transaction

Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28583


# e3bdf3da 31-Aug-2021 Alexander Motin <mav@FreeBSD.org>

nvme(4): Add MSI and single MSI-X support.

If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after: 1 week


# dd2516fc 08-Feb-2021 Warner Losh <imp@FreeBSD.org>

nvme: Make nvme_ctrlr_hw_reset static

nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so
make it static. If we need to change this in the future we can.


# 9600aa31 08-Feb-2021 Warner Losh <imp@FreeBSD.org>

nvme: use NVME_GONE rather than hard-coded 0xffffffff

Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.


# ac90f70d 28-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Increase nvme(4) maximum transfer size from 1MB to 2MB.

With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs. The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 91387707 23-Nov-2020 Michal Meloun <mmel@FreeBSD.org>

Ensure that the buffer is in nvme_single_map() mapped to single segment.
Not a functional change.

MFC after: 1 week


# 71460dfc 05-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

nvme: change namei_request_zone into a malloc type

Both the size (128 bytes) and ephemeral nature of allocations make it a great
fit for malloc.

A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects.

Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D27103


# d87b31e1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

nvme: clean up empty lines in .c and .h files


# ead7e103 18-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Make polled request timeout less invasive.

Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 67abaee9 07-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Add Host Memory Buffer support to nvme(4).

This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance. Device reports minimal and preferable
size of the buffer. The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.

MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.


# 7588c6cc 13-Dec-2019 Warner Losh <imp@FreeBSD.org>

Move to using bool instead of boolean_t

While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999


# 1eab19cb 23-Sep-2019 Alexander Motin <mav@FreeBSD.org>

Make nvme(4) driver some more NUMA aware.

- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after: 1 month
Sponsored by: iXsystems, Inc.


# f93b7f95 04-Sep-2019 Warner Losh <imp@FreeBSD.org>

Support doorbell strides != 0.

The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514


# 4d547561 03-Sep-2019 Warner Losh <imp@FreeBSD.org>

Implement nvme suspend / resume for pci attachment

When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.

On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.

Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.

Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.

Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.

Finally, fix a couple of spelling errors in comments related to
this.

Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493


# 31b11bb3 02-Sep-2019 Warner Losh <imp@FreeBSD.org>

In nvme_completion_poll, add a sanity check to make sure that we complete the
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.


# ab0681aa 02-Sep-2019 Warner Losh <imp@FreeBSD.org>

In all the places that we use the polled for completion interface, except crash
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.


# f182f928 21-Aug-2019 Warner Losh <imp@FreeBSD.org>

Separate the pci attachment from the rest of nvme

Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.

Submitted by: cognet@


# 97be8b96 14-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Report NOIOB and NPWG fields as stripe size.

Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace
Preferred Write Granularity added in 1.4 allow upper layers to align
I/Os for improved SSD performance and endurance.

I don't have hardware reportig those yet, but NPWG could probably be
reported by bhyve.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 5e83c2ff 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of commands that exhaust their retry limit.

While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.


# c37fc318 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of retried commands.

Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.


# 1071b50a 18-Jul-2019 Warner Losh <imp@FreeBSD.org>

Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump.

Also convert it to a bool. While the rest of the driver isn't yet bool clean,
this will help.

Reviewed by: cem@
Differential Revision: https://reviews.freebsd.org/D20988


# c75bdc04 18-Jul-2019 Warner Losh <imp@FreeBSD.org>

Provide new tunable hw.nvme.verbose_cmd_dump

The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.

Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988


# 2ffd6fce 08-Mar-2019 Warner Losh <imp@FreeBSD.org>

Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431


# 45d7e233 27-Feb-2019 Warner Losh <imp@FreeBSD.org>

Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.


# d706306d 27-Feb-2019 Warner Losh <imp@FreeBSD.org>

Remove #ifdef code to support FreeBSD versions that haven't been
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.


# 09efa3df 26-Oct-2018 Warner Losh <imp@FreeBSD.org>

Put a workaround in for command timeout malfunctioning

At least one NVMe drive has a bug that makeing the Command Time Out
PCIe feature unreliable. The workaround is to disable this
feature. The driver wouldn't deal correctly with a timeout anyway.
Only do this for drives that are known bad.

Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17708


# f439e3a4 24-May-2018 Alexander Motin <mav@FreeBSD.org>

Refactor NVMe CAM integration.

- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.


# d85d9648 15-Mar-2018 Warner Losh <imp@FreeBSD.org>

Try polling the qpairs on timeout.

On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix


# 0d787e9b 22-Feb-2018 Wojciech Macek <wma@FreeBSD.org>

NVMe: Add big-endian support

Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916


# 29077eb4 28-Jan-2018 Warner Losh <imp@FreeBSD.org>

Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053


# ce1ec9c1 18-Dec-2017 Warner Losh <imp@FreeBSD.org>

When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389


# 718cf2cc 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/dev: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# bb1c7be4 15-Oct-2017 Warner Losh <imp@FreeBSD.org>

Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@


# 51977281 29-Aug-2017 Warner Losh <imp@FreeBSD.org>

Add CAM/NVMe support for CAM_DATA_SG

This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@


# c02565f9 28-Aug-2017 Warner Losh <imp@FreeBSD.org>

Set the max transactions for NVMe drives better.

Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix


# 696c9502 25-Aug-2017 Warner Losh <imp@FreeBSD.org>

NVME Namespace ID is 32-bits, so widen interface to reflect that.

Sponsored by: Netflix


# a965389b 07-Nov-2016 Scott Long <scottl@FreeBSD.org>

Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by: jimharris
Obtained from: Netflix
Differential Revision: D8453


# 3a31c31c 20-Jul-2016 Warner Losh <imp@FreeBSD.org>

Actually import nvme_sim so the CAM attachment for NVME (nda) actually
works.

MFC after: 1 week


# f24c011b 10-Jun-2016 Warner Losh <imp@FreeBSD.org>

Commit the bits of nda that were missed. This should fix the build.

Approved by: re@


# 2b647da7 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: do not revert o single I/O queue when per-CPU queues not possible

Previously nvme(4) would revert to a signle I/O queue if it could not
allocate enought interrupt vectors or NVMe submission/completion queues
to have one I/O queue per core. This patch determines how to utilize a
smaller number of available interrupt vectors, and assigns (as closely
as possible) an equal number of cores to each associated I/O queue.

MFC after: 3 days
Sponsored by: Intel


# e5af5854 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: do not pre-allocate MSI-X IRQ resources

The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after: 3 days
Sponsored by: Intel


# c75ad8ce 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: remove per_cpu_io_queues from struct nvme_controller

Instead just use num_io_queues to make this determination.

This prepares for some future changes enabling use of multiple
queues when we do not have enough queues or MSI-X vectors
for one queue per CPU.

MFC after: 3 days
Sponsored by: Intel


# 36b0e4ee 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: remove CHATHAM related code

Chatham was an internal NVMe prototype board used for
early driver development.

MFC after: 1 week
Sponsored by: Intel


# a6e30963 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: create separate DMA tag for non-payload DMA buffers

Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.

Submitted by: kib
MFC after: 1 week
Sponsored by: Intel


# f42ca756 18-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by: Intel
MFC after: 3 days


# 496a2752 18-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Close hole where nvd(4) would not be notified of all nvme(4)
instances if modules loaded during boot.

Sponsored by: Intel
MFC after: 3 days


# bb2f67fd 08-Oct-2013 Jim Harris <jimharris@FreeBSD.org>

Log and then disable asynchronous notification of persistent events after
they occur.

This prevents repeated notifications of the same event.

Status of these events may be viewed at any time by viewing the
SMART/Health Info Page using nvmecontrol, whether or not asynchronous
events notifications for those events are enabled. This log page can
be viewed using:

nvmecontrol logpage -p 2 <ctrlr id>

Future enhancements may re-enable these notifications on a periodic basis
so that if the notified condition persists, it will continue to be logged.

Sponsored by: Intel
Reviewed by: carl
Approved by: re (hrs)
MFC after: 1 week


# a40e72a6 08-Oct-2013 Jim Harris <jimharris@FreeBSD.org>

Add driver-assisted striping for upcoming Intel NVMe controllers that can
benefit from it.

Sponsored by: Intel
Reviewed by: kib (earlier version), carl
Approved by: re (hrs)
MFC after: 1 week


# 56183abc 13-Aug-2013 Jim Harris <jimharris@FreeBSD.org>

Send a shutdown notification in the driver unload path, to ensure
notification gets sent in cases where system shuts down with driver
unloaded.

Sponsored by: Intel
Reviewed by: carl
MFC after: 3 days


# bd6b0ac5 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Add comment explaining why CACHE_LINE_SIZE is defined in nvme_private.h
if not already defined elsewhere.

Requested by: attilio
MFC after: 3 days


# e9efbc13 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Update copyright dates.

MFC after: 3 days


# bbd412dd 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Remove remaining uio-related code.

The nvme_physio() function was removed quite a while ago, which was the
only user of this uio-related code.

Sponsored by: Intel
MFC after: 3 days


# 8d09e3c4 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Use MAXPHYS to specify the maximum I/O size for nvme(4).

Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed. The larger I/O size is very
beneficial for upcoming firmware download support. This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.

Sponsored by: Intel
MFC after: 3 days


# ca269f32 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Move the busdma mapping functions to nvme_qpair.c.

This removes nvme_uio.c completely.

Sponsored by: Intel


# 97fafe25 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Add a mutex to each namespace, for general locking operations on the namespace.

Sponsored by: Intel


# a90b8104 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Rename the controller's fail_req_lock, so that it can be used for other
locking operations on the controller.

Sponsored by: Intel


# 5fdf9c3c 01-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Add unmapped bio support to nvme(4) and nvd(4).

Sponsored by: Intel


# 1e526bc4 29-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or
NULL. This simplifies decisions around if/how requests are routed through
busdma. It also paves the way for supporting unmapped bios.

Sponsored by: Intel


# 547d523e 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Clean up debug prints.

1) Consistently use device_printf.
2) Make dump_completion and dump_command into something more
human-readable.

Sponsored by: Intel
Reviewed by: carl


# dd433dd0 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Move common code from the different nvme_allocate_request functions into a
separate function.

Sponsored by: Intel
Suggested by: carl
Reviewed by: carl


# 955910a9 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Replace usages of mtx_pool_find used for admin commands with a polling
mechanism.

Now that all requests are timed, we are guaranteed to get a completion
notification, even if it is an abort status due to a timed out admin
command.

This has the effect of simplifying the controller and namespace setup
code, so that it reads straight through rather than broken up into
a bunch of different callback functions.

Sponsored by: Intel
Reviewed by: carl


# 232e2edb 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add the ability to internally mark a controller as failed, if it is unable to
start or reset. Also add a notifier for NVMe consumers for controller fail
conditions and plumb this notifier for nvd(4) to destroy the associated
GEOM disks when a failure occurs.

This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state. To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.

Sponsored by: Intel
Reviewed by: carl


# be34f216 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Remove the is_started flag from struct nvme_controller.

This flag was originally added to communicate to the sysctl code
which oids should be built, but there are easier ways to do this. This
needs to be cleaned up prior to adding new controller states - for example,
controller failure.

Sponsored by: Intel
Reviewed by: carl


# 02e33484 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Ensure the controller's MDTS is accounted for in max_xfer_size.

The controller's IDENTIFY data contains MDTS (Max Data Transfer Size) to
allow the controller to specify the maximum I/O data transfer size. nvme(4)
already provides a default maximum, but make sure it does not exceed what
MDTS reports.

Sponsored by: Intel
Reviewed by: carl


# cb5b7c13 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Cap the number of retry attempts to a configurable number. This ensures
that if a specific I/O repeatedly times out, we don't retry it indefinitely.

The default number of retries will be 4, but is adjusted using hw.nvme.retry_count.

Sponsored by: Intel
Reviewed by: carl


# 0d7e13ec 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Pass associated log page data to async event consumers, if requested.

Sponsored by: Intel
Reviewed by: carl


# 2868353a 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

When an asynchronous event request is completed, automatically fetch the
specified log page.

This satisfies the spec condition that future async events of the same type
will not be sent until the associated log page is fetched.

Sponsored by: Intel
Reviewed by: carl


# 0692579b 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add structure definitions and controller command function for firmware
log pages.

Sponsored by: Intel
Reviewed by: carl


# 08927782 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add structure definitions and a controller command function for
error log pages.

Sponsored by: Intel
Reviewed by: carl


# f37c22a3 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Make nvme_ctrlr_reset a nop if a reset is already in progress.

This protects against cases where a controller crashes with multiple
I/O outstanding, each timing out and requesting controller resets
simultaneously.

While here, remove a debugging printf from a previous commit, and add
more logging around I/O that need to be resubmitted after a controller
reset.

Sponsored by: Intel
Reviewed by: carl


# 48ce3178 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

By default, always escalate to controller reset when an I/O times out.

While aborts are typically cleaner than a full controller reset, many times
an I/O timeout indicates other controller-level issues where aborts may not
work. NVMe drivers for other operating systems are also defaulting to
controller reset rather than aborts for timed out I/O.

Sponsored by: Intel
Reviewed by: carl


# 94143332 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add a tunable for the I/O timeout interval. Default is still 30 seconds,
but can be adjusted between a min/max of 5 and 120 seconds.

Sponsored by: Intel
Reviewed by: carl


# 12d191ec 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add handling for controller fatal status (csts.cfs).

On any I/O timeout, check for csts.cfs==1. If set, the controller
is reporting fatal status and we reset the controller immediately,
rather than trying to abort the timed out command.

This changeset also includes deferring the controller start portion
of the reset to a separate task. This ensures we are always performing
a controller start operation from a consistent context.

Sponsored by: Intel
Reviewed by: carl


# b846efd7 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add controller reset capability to nvme(4) and ability to explicitly
invoke it from nvmecontrol(8).

Controller reset will be performed in cases where I/O are repeatedly
timing out, the controller reports an unrecoverable condition, or
when explicitly requested via IOCTL or an nvme consumer. Since the
controller may be in such a state where it cannot even process queue
deletion requests, we will perform a controller reset without trying
to clean up anything on the controller first.

Sponsored by: Intel
Reviewed by: carl


# 65c2474e 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Keep a doubly-linked list of outstanding trackers.

This enables in-order re-submission of I/O after a controller reset.

Sponsored by: Intel


# 99d99f74 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Expose the get/set features API to nvme consumers.

Sponsored by: Intel


# 038a5ee4 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add an interface for nvme shim drivers (i.e. nvd) to register for
notifications when new nvme controllers are added to the system.

Sponsored by: Intel


# 0a0b08cc 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Enable asynchronous event requests on non-Chatham devices.

Also add logic to clean up all outstanding asynchronous event requests
when resetting or shutting down the controller, since these requests
will not be explicitly completed by the controller itself.

Sponsored by: Intel


# 990e741c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Move controller destruction code from nvme_detach() to new nvme_ctrlr_destruct()
function.

Sponsored by: Intel


# 274b3a88 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Specify command timeout interval on a per-command type basis.

This is primarily driven by the need to disable timeouts for asynchronous
event requests, which by nature should not be timed out.

Sponsored by: Intel


# 448195e7 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add support for ABORT commands, including issuing these commands when
an I/O times out.

Also ensure that we retry commands that are aborted due to a timeout.

Sponsored by: Intel


# 91fe20e3 18-Dec-2012 Jim Harris <jimharris@FreeBSD.org>

Map BAR 4/5, because NVMe spec says devices may place the MSI-X table
behind BAR 4/5, rather than in BAR 0/1 with the control/doorbell registers.

Sponsored by: Intel


# 4d6abcb1 18-Dec-2012 Jim Harris <jimharris@FreeBSD.org>

Do not use taskqueue to defer completion work when using INTx. INTx now
matches MSI-X behavior.

Sponsored by: Intel


# 38ce9496 06-Dec-2012 Jim Harris <jimharris@FreeBSD.org>

Add PCI device ID for 8-channel IDT NVMe controller, and clarify that the
previously defined IDT PCI device ID was for a 32-channel controller.

Submitted by: Joe Golio <joseph.golio@isilon.com>


# 0f71ecf7 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add ability to queue nvme_request objects if no nvme_trackers are available.

This eliminates the need to manage queue depth at the nvd(4) level for
Chatham prototype board workarounds, and also adds the ability to
accept a number of requests on a single qpair that is much larger
than the number of trackers allocated.

Sponsored by: Intel


# 21b6da58 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Preallocate a limited number of nvme_tracker objects per qpair, rather
than dynamically creating them at runtime.

Sponsored by: Intel


# 5ae9ed68 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Create nvme_qpair_submit_request() which eliminates all of the code
duplication between the admin and io controller-level submit
functions.

Sponsored by: Intel


# 5fa5cc5f 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Cleanup uio-related code to use struct nvme_request and
nvme_ctrlr_submit_io_request().

While here, also fix case where a uio may have more than 1 iovec.
NVMe's definition of SGEs (called PRPs) only allows for the first SGE to
start on a non-page boundary. The simplest way to handle this is to
construct a temporary uio for each iovec, and submit an NVMe request
for each.

Sponsored by: Intel


# d281e8fb 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add nvme_ctrlr_submit_[admin|io]_request functions which consolidates
code for allocating nvme_tracker objects and making calls into
bus_dmamap_load for commands which have payloads.

Sponsored by: Intel


# ad697276 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add struct nvme_request object which contains all of the parameters passed
from an NVMe consumer.

This allows us to mostly build NVMe command buffers without holding the
qpair lock, and also allows for future queueing of nvme_request objects
in cases where the submission queue is full and no nvme_tracker objects
are available.

Sponsored by: Intel


# f2b19f67 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Merge struct nvme_prp_list into struct nvme_tracker.

This simplifies the driver significantly where it is constructing
commands to be submitted to hardware. By reducing the number of
PRPs (NVMe parlance for SGE) from 128 to 32, it ensures we do not
allocate too much memory for more common smaller I/O sizes, while
still supporting up to 128KB I/O sizes.

This also paves the way for pre-allocation of nvme_tracker objects
for each queue which will simplify the I/O path even further.

Sponsored by: Intel


# 6568ebfc 10-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Count number of times each queue pair's interrupt handler is invoked.

Also add sysctls to query and reset each queue pair's stats, including
the new count added here.

Sponsored by: Intel


# 8bed48f2 10-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Put the nvme_qpair mutex on its own cacheline.

Sponsored by: Intel


# bb0ec6b3 17-Sep-2012 Jim Harris <jimharris@FreeBSD.org>

This is the first of several commits which will add NVM Express (NVMe)
support to FreeBSD. A full description of the overall functionality
being added is below. nvmexpress.org defines NVM Express as "an optimized
register interface, command set and feature set fo PCI Express (PCIe)-based
Solid-State Drives (SSDs)."

This commit adds nvme(4) and nvd(4) driver source code and Makefiles
to the tree.

Full NVMe functionality description:
Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe)
device support.

There will continue to be ongoing work on NVM Express support, but there
is more than enough to allow for evaluation of pre-production NVM Express
devices as well as soliciting feedback. Questions and feedback are welcome.

nvme(4) implements NVMe hardware abstraction and is a provider of NVMe
namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN.
nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks.
nvmecontrol(8) is used for NVMe configuration and management.

The following are currently supported:
nvme(4)
- full mandatory NVM command set support
- per-CPU IO queues (enabled by default but configurable)
- per-queue sysctls for statistics and full command/completion queue
dumps for debugging
- registration API for NVMe namespace consumers
- I/O error handling (except for timeoutsee below)
- compilation switches for support back to stable-7

nvd(4)
- BIO_DELETE and BIO_FLUSH (if supported by controller)
- proper BIO_ORDERED handling

nvmecontrol(8)
- devlist: list NVMe controllers and their namespaces
- identify: display controller or namespace identify data in
human-readable or hex format
- perftest: quick and dirty performance test to measure raw
performance of NVMe device without userspace/physio/GEOM
overhead

The following are still work in progress and will be completed over the
next 3-6 months in rough priority order:
- complete man pages
- firmware download and activation
- asynchronous error requests
- command timeout error handling
- controller resets
- nvmecontrol(8) log page retrieval

This has been primarily tested on amd64, with light testing on i386. I
would be happy to provide assistance to anyone interested in porting
this to other architectures, but am not currently planning to do this
work myself. Big-endian and dmamap sync for command/completion queues
are the main areas that would need to be addressed.

The nvme(4) driver currently has references to Chatham, which is an
Intel-developed prototype board which is not fully spec compliant.
These references will all be removed over time.

Sponsored by: Intel
Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>