History log of /freebsd-current/sys/dev/nvme/nvme_ctrlr.c
Revision Date Author Comments
# da4230af 13-May-2024 John Baldwin <jhb@FreeBSD.org>

nvme/f: Use strlcpy instead of strncpy + manual string termination

Reviewed by: dab, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45153


# 97b77de2 16-Apr-2024 Warner Losh <imp@FreeBSD.org>

nvme: Eliminate intel_log_temp_stats_swapbytes

We can't post a AER for this page, so there's no need to be able to swap
it to host byte order. It's not one of the standard defined pages that
can post via AER, and the vendor's public docs for this temperature page
don't suggest it's possible to get over or under event changes. Since
nvmecontrol no longer needsd the swap routine, remove it since it's
now unused.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D44659


# b354bb04 22-Mar-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Add constants for fields in AER completion dword 0

Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44445


# 2a2682ee 06-Mar-2024 Warner Losh <imp@FreeBSD.org>

nvme: Add SMART WARNING for persistent memory region

NVME 2.0 added persistent memory regions, and this bit reports critical
warnings / errors with those regions.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D44213


# 5cdedf67 06-Mar-2024 Warner Losh <imp@FreeBSD.org>

nvme: Log reset success or failure to devd

We're logging when we start a reset, but not when we complete it, nor
the result. Create now log a success or timed_out event for the reset.
Currently, the only detectable error we have from reset is 'failure to
become ready in time,' though the code looks like it might be more
generic. Log this and if we ever have other failure modes, change the
logging to devd when that happens.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D44211


# 4f817fcf 06-Mar-2024 Warner Losh <imp@FreeBSD.org>

nvme: Change devctl events for the controller

Change the devctl events slightly for the controller. SMART errors will
log the changed bits in the NVME SMART Critical Warning State as its
event.

Reset will now emit 'event=start'. Soon more.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D44210


# fc3afe93 06-Mar-2024 Warner Losh <imp@FreeBSD.org>

nvme: split devctl out to its own function

Split the devctl aspect of things out to its own function in
nvme_ctrlr_devctl_log. In preparing to document this, and based on
actual use, we want something different for the SMART errors, so this
will facilitate that.

Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D44209


# c5246cb7 01-Mar-2024 Warner Losh <imp@FreeBSD.org>

nvme: Report only the unknown bits

When we get a smart error that's unknown, report only the unknown
(reserved) bits of the Critical Warning Bitfield.

Sponsored by: Netflix


# 7485926e 01-Mar-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Firmware revisions in the firmware slot info logpage are ASCII strings

In particular, don't try to byteswap the values as 64-bit integers and
always print a non-empty version as a string.

Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44121


# 5650bd3f 29-Jan-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Use the NVMEF macro to construct fields

Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43605


# 8488fc41 29-Jan-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Use the NVMEM macro instead of expanded versions

A few of these omitted a shift of 0, but this is more consistent.

Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43602


# 479680f2 29-Jan-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Use the NVMEV macro instead of expanded versions

Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43595


# b46c7b1e 27-Dec-2023 Alexander Motin <mav@FreeBSD.org>

nvme: Add some bits from NVMe 2.0c spec.

MFC after: 1 week


# d9b7301b 18-Dec-2023 Mark Johnston <markj@FreeBSD.org>

nvme: Initialize HMB entries before loading them into the controller

struct nvme_hmb_desc contains a pad field which was not getting
initialized before being synced. This doesn't have much consequence but
triggers a report from KMSAN, which verifies that host-filled DMA memory
is initialized before it is made visible to the device. So, let's just
initialize it properly.

Reported by: KMSAN
Reviewed by: mav, imp
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D43090


# 34a6ad84 17-Nov-2023 Warner Losh <imp@FreeBSD.org>

nvme: Don't use version to listen for events for ns and fw changes

Instead, use the attribtue bits from the identification data to
determine if we should listen to namespace changes and firmware
activation. Should have no functional change, though we may stop
listening for events that will never happen.

Sponsored by: Netflix


# fd9a4a67 06-Nov-2023 Warner Losh <imp@FreeBSD.org>

cam: Minor opt_cam.h cleanup

sys/cam/cam.h includes opt_cam.h, so none of the clients need to do
this. cam.h does all the right dancing to conditionally include
opt_cam.h only when it makes sense. It generally only matters when
cam_debug.h is included (it must be included before that). Many of the
stray opt_cam.h includes were after cam_debug.h which would be a problem
were it not included in cam/cam.h. The other users of CAM options that
aren't debug all already include cam/cam.h.

Also trim unneeded sys/cdefs.h files from the files touched.

Sponsored by: Netflix


# 8d6c0743 06-Nov-2023 Alexander Motin <mav@FreeBSD.org>

nvme: Introduce longer timeouts for admin queue

KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.

Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454


# 6b2a6e9c 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove stale comment

After da8324a9258f, the pre/post hooks are gone. So remove a coment
about why we don't call them in this case.

Sponsored by: Netflix
Reviewed by: chuck, jhb
Differential Revision: https://reviews.freebsd.org/D42050


# 40261289 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: Really remove NVME_2X_RESET

da8324a9258f removed one of the two instances of NVME_2X_RESET. It
failed to snag the other one, and remove it from the options file.
Remove from both of those here.

Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42049


# bc85cd30 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: gc nvme_ctrlr_post_failed_request and related task stuff

In 4b977e6dda92 we removed the call to nvme_ctrlr_post_failed_request
because we can now directly fail requests in this context since we're in
the reset task already. No need to queue it. I left it in place against
future need, but it's been two years and no panics have resulted. Since
the static analysis (code checking) and the dyanmic analysis (surviving
in the field for 2 years, including at $WORK where we know we've gone
through this path when we've failed drives) both signal that it's not
really needed, go ahead and GC it. If we discover at a later date a flaw
in this analysis, we can add it back easily enough by reverting this and
4b977e6dda92.

Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42048


# 7ea866eb 07-Sep-2023 David Sloan <david.sloan@eideticom.com>

nvme: Fix memory leak in pt ioctl commands

When running nvme passthrough commands through the ioctl interface
memory is mapped with vmapbuf() but not unmapped. This results in leaked
memory whenever a process executes an nvme passthrough command with a
data buffer. This can be replicated with a simple c function (error
checks skipped for brevity):

void leak_memory(int nvme_ns_fd, uint16_t nblocks) {
struct nvme_pt_command pt = {
.cmd = {
.opc = NVME_OPC_READ,
.cdw12 = nblocks - 1,
},
.len = nblocks * 512, // Assumes devices with 512 byte lba
.is_read = 1, // Reads and writes should both trigger leak
}
void *buf;

posix_memalign(&buf, nblocks * 512);
pt.buf = buf;
ioctl(nvme_ns_fd, NVME_PASSTHROUGH_COMMAND, &pt);
free(buf);
}

Signed-off-by: David Sloan <david.sloan@eideticom.com>

PR: 273626
Reviewed by: imp, markj
MFC after: 1 week


# da8324a9 24-Sep-2023 Warner Losh <imp@FreeBSD.org>

nvme: Fix locking protocol violation to fix suspend / resume

Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.

Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.

Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.

Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.

Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.

These are all interrelated, so need to be done at the same time.

Sponsored by: Netflix
Reviewed by: jhb
Tested by: jhb (made sure suspend / resume worked)
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D41866


# 8052b01e 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add exclusion for ISR

Add a basically uncontended spinlock that we take out while the ISR is
running. This has two effects: First, when we get a timeout, we can
safely call the nvme_qpair_process_completions w/o racing any ISRs.
Second, we can use it to ensure that we don't reset the card while
the ISRs are active (right now we just sleep and hope for the best,
which usually is fine, but not always).

Sponsored by: Netflix
MFC After: 2 weeks
Reviewed by: chuck, gallatin
Differential Revision: https://reviews.freebsd.org/D41452


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 6e8ab671 04-Jun-2022 Gordon Bergling <gbe@FreeBSD.org>

nvmw(4): Fix a typo in a source code comment

- s/inaccessable/inaccessible/

MFC after: 3 days


# 3740a8db 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Further refinements in Host Memory Buffer Sizing

Host Memory Buffer units are a mix. For those in the identify structure,
the size is in 4kiB chunks. For specifying the buffer description,
though, they are in terms of the drive's MPS. Add comments to this
effect and change PAGE_SIZE to ctrlr->page_size where needed, as well as
correct a mistaken use of NVME_HPS_UNITS in 214df80a9cb3 as pointed out
by rpokala@ after the commit. No functional change is intended, as
page_size is still 4k which matches all current hosts' PAGE_SIZE, but to
support 16k pages on arm, we need to differentiate these two cases.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34871


# 3086efe8 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Remove NVME_MAX_XFER_SIZE, replace inline calculation

NVME_MAX_XFER_SIZE used to be a constant (back when MAXPHYS was a
constant) to denote the smaller of MAXPHYS or the largest PRP we could
encode with our prealloation scheme. However, it's no longer constant
since MAXPHYS varies at runtime. In addition, the actual maximum is now
based on the drive's currently in use page_size, which is also a runtime
expression. As such, remove the define and expand it inline in the one
place its used still in the tree.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34870


# 3a468f20 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Use saved mps when initializing drive

Make sure we set the MPS we cached (currently the drives minimum mps) in
CC (Controller Configuration) when reinitializing the drive. It must
match the page_size that we're going to use. Also retire less specific
NVME_PAGE_SHIFT since it's now unused.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34869


# 55412ef9 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Rename min_page_size to page_size and save mps

The Memory Page Size sets the basic unit of operation for the drive. We
currently set this to the drive's minimum page size, but we could set it
to any page size the drive supports in the future. Replace min_page_size
(it's now unused for that purpose) with page_size to reflect this and
cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34868


# 6e3deec8 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Base maximum data transfer size directly on MPSMIN in cap_hi

Calculate the maxmimum transfer size based on the MPSMIN we have in our
cached copy of cap_hi rather than using min_page_size in the controller.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D34867


# 214df80a 08-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: new define for size of host memory buffer sizes

The nvme spec defines the various fields that specify sizes for host
memory buffers in terms of 4096 chunks. So, rather than use a bare 4096
here, use NVME_HMB_UNITS. This is explicitly not the host page size of
4096, nor the default memory page size (mps) of the NVMe drive, but its
own thing and needs its own define.

No functional change is intended, only the logical spelling of 4k.

Sponsored by: Netflix


# 6af6a52e 29-Mar-2022 Warner Losh <imp@FreeBSD.org>

nvme: Save cap_lo and cap_hi

Save the capabilities for the drive.

Sponsored by: Netflix


# a70b5660 29-Mar-2022 Warner Losh <imp@FreeBSD.org>

nvme: MPS is a power of two, not a size / 8k

Setting MPS in the CC should be a power of 2 number (it specifies the
page size of the host is 2^(12+MPS)), so adjust the calcuation. There is
no functional change because we do not support any architecutres != 4k
pages (yet). Other changes are needed for architectures with 16k or 64k
pages, especially when the underlying NVMe drive doesn't support that
page size (Most drives support a range that's small, and many only
support 4k), but let's at least do this calculation correctly. 12 - 12
is just as much 0 as 4096 >> 13 is :)

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D34707


# 83581511 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use adaptive spinning when polling for completion or state change

We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.

Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.

Tested by: cperciva
Sponsored by: Netflix
Reviewed by: mav
Differential Review: https://reviews.freebsd.org/D32259


# 4b3da659 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Only reset once on attach.

The FreeBSD nvme driver has reset the nvme controller twice on attach to
address a theoretical issue assuring the hardware is in a known
state. However, exierence has shown the second reset is unnecessary and
increases the time to boot. Eliminate the second reset. Should there be
a situation when you need a second reset (for buggy or at least somewhat
out of the mainstream hardware), the hardware option NVME_2X_RESET will
restore the old behavior. Document this in nvme(4).

If there's any trouble at all with this, I'll add a sysctl tunable to
control it.

Sponsored by: Netflix
Reviewed by: cperciva, mav
Differential Revision: https://reviews.freebsd.org/D32241


# e5e26e4a 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Remove pause while resetting

After some study of the code and the standard, I think we can just drop
the pause(), unconditionally. If we're not initialized, then there's
nothing to wait for from a software perspective. If we are initialized,
then there might be outstanding I/O. If so, then the qpair 'recovery
state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which
will ignore any interrupts for items that complete before we complete
the reset by setting cc.en=0.

If we go on to fail the controller, we'll cancel the outstanding I/O
transactions. If we reset the controller, the hardware throws away
pending transactions and we retry all the pending I/O transactions. Any
transactions that happend to complete before cc.en=0 will have the same
effect in the end (doing the same transaction twice is just inefficient,
it won't affect the state of the device any differently than having done
it once).

The standard imposes no wait times here, so it isn't needed from that
perspective.

Unanswered Question: Do we may need to disable interrupts while we
disable in legacy mode since those are level-sensitive.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32248


# 77054a89 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Explain a workaround a little better

The don't touch the mmio of the drive after we do a EN 1->0 transition
is only for a tiny number of dirves that have this unforunate issue.

Sponsored by: Netflix


# a245627a 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme_ctrlr_enable: Small style nits

Rewrite the nested if's using the preferred FreeBSD style for branches
of ifs that return. NFC. Minor tweaks to the comments to better fit new
code layout.

Sponsored by: Netflix
Reviewed by: mav, chuck (prior rev, but comments rolled in)
Differential Revision: https://reviews.freebsd.org/D32245


# 26259f6a 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use MS_2_TICKS rather than rolling our own

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32246


# d5fca1dc 01-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme_ctrlr_enable: Remove unnecessary 5ms delays

Remove the 5ms delays after writing the administrative queue
registers. These delays are from the very earliest days of the driver
(they are in the first commit) and were most likely vestiges of the
Chatham NVMe prototype card that was used to create this driver. Many of
the workarounds necessary for it aren't necessary for standards
compliant cards. The original driver had other areas marked for Chatham,
but these were not. They are unneeded. There's three lines of supporting
evidence.

First, the NVMe standards make no mention of a delay time after these
registers are written. Second, the Linux driver doesn't have them, even
as an option. Third, all my nvme cards work w/o them.

To be safe, add a write barrier between setting up the admin queue and
enabling the controller.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32247


# 502dc84a 23-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use shared timeout rather than timeout per transaction

Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28583


# bad42df9 05-Sep-2021 Colin Percival <cperciva@FreeBSD.org>

Add some nvme initialization routines to TSLOG

About 335 ms of EC2 instance boot time is being spent here.


# e3bdf3da 31-Aug-2021 Alexander Motin <mav@FreeBSD.org>

nvme(4): Add MSI and single MSI-X support.

If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after: 1 week


# 31111372 30-Aug-2021 Alexander Motin <mav@FreeBSD.org>

nvme(4): Do not panic on admin queue construct error.

MFC after: 1 week


# f0f47121 28-May-2021 Warner Losh <imp@FreeBSD.org>

nvme: fix a race between failing the controller and failing requests

Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.

Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.

Reviewed by: jhb@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30366


# 4fbbe523 17-Mar-2021 Alexander Motin <mav@FreeBSD.org>

nvme: Replace potentially long DELAY() with pause().

In some cases like broken hardware nvme(4) may wait minutes for
controller response before timeout. Doing so in a tight spin loop
made whole system unresponsive.

Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29309
Sponsored by: iXsystems, Inc.


# 8423f5d4 11-Mar-2021 Warner Losh <imp@FreeBSD.org>

nvme: use config_intrhook_drain to avoid removable card races

nvme drives are configured early in boot. However, a number of the configuration
steps takes which take a while, so we defer those to a config intrhook that runs
before the root filesystem is mounted. At the same time, the PCI hot plug wakes
up and tests the status of the card. It may decide that the card has gone away
and deletes the child. As part of that process nvme_detach is called. If this
call happens after the config_intrhook starts to run, but before it is finished,
there's a race where we can tear down the device's soft state while the
config_intrhook is still using it. Use the new config_intrhook_drain to
disestablish the hook. Either it will be removed w/o running, or the routine
will wait for it to finish. This closes the race and allows safe hotplug at any
time, even very early in boot.

Sponsored by: Netflix, Inc
Reviewed by: jhb, mav
Differential Revision: https://reviews.freebsd.org/D29006


# dd2516fc 08-Feb-2021 Warner Losh <imp@FreeBSD.org>

nvme: Make nvme_ctrlr_hw_reset static

nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so
make it static. If we need to change this in the future we can.


# 9600aa31 08-Feb-2021 Warner Losh <imp@FreeBSD.org>

nvme: use NVME_GONE rather than hard-coded 0xffffffff

Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.


# 1770bae5 28-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Remove aligment requirements for passthrough buffer.

After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers
thanks to extra page added to pbuf_zone.


# ac90f70d 28-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Increase nvme(4) maximum transfer size from 1MB to 2MB.

With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs. The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# 0bed3eab 13-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Add PMRCAP printing and fix earlier CAP_HI.

MFC after: 3 days


# 46fbd800 12-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Fix panic if NVMe is detached before the intrhook call.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# c44441f8 28-Oct-2020 Alexander Motin <mav@FreeBSD.org>

Print NVMe controller capabilities in verbose dmesg.

Those values are not reported in controller identification, while sometimes
interesting for development and debugging.

MFC after: 1 week


# 44ca4575 21-Oct-2020 Brooks Davis <brooks@FreeBSD.org>

vmapbuf: don't smuggle address or length in buf

Instead, add arguments to vmapbuf. Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by: jhb
Reviewed by: imp, jhb
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26784


# 915f0197 14-Oct-2020 Alexander Motin <mav@FreeBSD.org>

Use RTD3 Entry Latency value as shutdown timeout.

This field was not in specs when the driver was written, but now there
are SSDs with the reported latency of 10s, where hardcoded value of 5s
seems to be not enough sometimes, causing shutdown timeout messages.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# e32d47f3 21-Sep-2020 David Bright <dab@FreeBSD.org>

Add an ioctl to get an NVMe device's maximum transfer size

Reviewed by: imp, chuck
Obtained from: Dell EMC Isilon
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26390


# d87b31e1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

nvme: clean up empty lines in .c and .h files


# 881534f0 31-Aug-2020 Warner Losh <imp@FreeBSD.org>

Use symbolic names for asych events

Rather than |= 0x300, define and use asyn event names for the name
space changes and the firmware activations that we're asking for.


# 701267ad 25-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Fix few panics on NVMe's timing out initialization requests.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# ead7e103 18-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Make polled request timeout less invasive.

Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 550d5d64 17-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Fix admin qpair leak if detached during initial reset.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 92390644 12-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Fix config_intrhook leak on initial reset failure.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 4053f8ac 02-May-2020 David Bright <dab@FreeBSD.org>

Fix various Coverity-detected errors in nvme driver

This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by: imp@, vangyzen@
MFC after: 5 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24532


# 4e6a434b 29-Apr-2020 Warner Losh <imp@FreeBSD.org>

Make sure that we get the sbuf resources we need.

Since we're calling sbuf_new with NOWAIT, make sure it can allocate a
buffer to use. Don't print anything if we can't get it.

Noticed by: rpokala


# 244b8053 29-Apr-2020 Warner Losh <imp@FreeBSD.org>

Generate a devctl event for interesting events

When we reset the controller, and when the controller tells us about a
critical warning, send an event.


# b2cdfb72 08-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Fix copy-paste bug in HMB free code.

MFC after: 2 weeks
X-MFC-with: r356474


# 6de4e458 07-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Minor adjustments to r356474 and r356480.

Reported by: jkim, imp
MFC after: 2 weeks
X-MFC-with: r356474


# 1c7dd40e 07-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Increate HMB limit from 1% to 5%.

SSD capacity in laptops is growing faster then RAM size, so my original
guess seems too low on second thought. Hopefully nobody will build large
array of those crappy SSDs.

MFC after: 2 weeks
X-MFC-with: 356474


# 67abaee9 07-Jan-2020 Alexander Motin <mav@FreeBSD.org>

Add Host Memory Buffer support to nvme(4).

This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about
1MB per 1GB on the devices I have) for its metadata cache, significantly
improving random I/O performance. Device reports minimal and preferable
size of the buffer. The code limits it to 1% of physical RAM by default.
If the buffer can not be allocated or below minimal size, the device will
just have to work without it.

MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.


# 7588c6cc 13-Dec-2019 Warner Losh <imp@FreeBSD.org>

Move to using bool instead of boolean_t

While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999


# 66e59850 11-Dec-2019 Warner Losh <imp@FreeBSD.org>

Move reset to the interrutp processing stage

This trims the boot time a bit more for AWS and other platforms that have nvme
drives. There's no reason too do this inline. This has been in my tree a while,
but IIRC I talked to Jim Harris about this at one of our face to face meetings.

MFC After: 2 weeks


# 1eab19cb 23-Sep-2019 Alexander Motin <mav@FreeBSD.org>

Make nvme(4) driver some more NUMA aware.

- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after: 1 month
Sponsored by: iXsystems, Inc.


# f93b7f95 04-Sep-2019 Warner Losh <imp@FreeBSD.org>

Support doorbell strides != 0.

The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514


# 4d547561 03-Sep-2019 Warner Losh <imp@FreeBSD.org>

Implement nvme suspend / resume for pci attachment

When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.

On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.

Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.

Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.

Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.

Finally, fix a couple of spelling errors in comments related to
this.

Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493


# ab0681aa 02-Sep-2019 Warner Losh <imp@FreeBSD.org>

In all the places that we use the polled for completion interface, except crash
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.


# 8e61280b 22-Aug-2019 Warner Losh <imp@FreeBSD.org>

When we have errors resetting the device before we allocate the
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.

This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.


# f182f928 21-Aug-2019 Warner Losh <imp@FreeBSD.org>

Separate the pci attachment from the rest of nvme

Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.

Submitted by: cognet@


# 71a28181 21-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Improve NVMe hot unplug handling.

If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# a6d222eb 02-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Add more random bits from NVMe 1.4.

MFC after: 2 weeks


# 6c99d132 02-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Decode few more NVMe log pages.

In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.

Add few new arguments to `nvmecontrol log` subcommand.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# a7bf63be 01-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Add IOCTL to translate nvdX into nvmeY and NSID.

While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.

Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 08a607e0 25-Jul-2019 Warner Losh <imp@FreeBSD.org>

Widen the type for to.

The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.

Reported by: bapt@


# 62d2cf18 18-Jul-2019 Warner Losh <imp@FreeBSD.org>

Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers.

These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.

Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979


# dc9df3a5 16-Jul-2019 Warner Losh <imp@FreeBSD.org>

Assume that the timeout value from the capacity is 1-based

Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.


# 9835d216 08-May-2019 Warner Losh <imp@FreeBSD.org>

rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs

Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to
match init/uninit scenarios.

Signed-off-by: John Meneghini <johnm@netapp.com>
Submitted by: Michael Hordijk <hordijk@netapp.com>
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D19781


# 2ffd6fce 08-Mar-2019 Warner Losh <imp@FreeBSD.org>

Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431


# 45d7e233 27-Feb-2019 Warner Losh <imp@FreeBSD.org>

Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.


# 756a5412 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.

o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.

Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.

The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with: gallatin
Tested by: pho


# 91182bcf 07-Dec-2018 Warner Losh <imp@FreeBSD.org>

Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli
(and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3.

Sponsored by: Netflix


# 9544e6dc 21-Aug-2018 Chuck Tuffli <chuck@FreeBSD.org>

Make NVMe compatible with the original API

The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404


# f439e3a4 24-May-2018 Alexander Motin <mav@FreeBSD.org>

Refactor NVMe CAM integration.

- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.

Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.

Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.


# c252f637 02-May-2018 Alexander Motin <mav@FreeBSD.org>

Fix LOR between controller and queue locks.

Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock. Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.

Sponsored by: iXsystems, Inc.


# e134ecdc 30-Apr-2018 Alexander Motin <mav@FreeBSD.org>

Improve nvme(4) attach/detach sequences.

This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.

Sponsored by: iXsystems, Inc.


# 5d7fd8f7 14-Mar-2018 Warner Losh <imp@FreeBSD.org>

Fix error messages in cut and pasted code.

Also, fix an unnecessary deref to get ctrlr.

Noticed by: rpokala@
Sponsored by: Netflix


# 8b1e6ebe 14-Mar-2018 Warner Losh <imp@FreeBSD.org>

When tearing down a queue pair, also delete the queue entries.

The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.

Sponsored by: Netflix


# 0d787e9b 22-Feb-2018 Wojciech Macek <wma@FreeBSD.org>

NVMe: Add big-endian support

Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916


# 29077eb4 28-Jan-2018 Warner Losh <imp@FreeBSD.org>

Use atomic load and stores to ensure that the compiler doesn't
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.

Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053


# 989c7f0b 18-Dec-2017 Warner Losh <imp@FreeBSD.org>

Although we only have one quirk at the moment, guard against the day
we have more than one by checking the actual quirk bit before delaying
the reset.

Noticed by: rpokala@


# ce1ec9c1 18-Dec-2017 Warner Losh <imp@FreeBSD.org>

When we're disabling the nvme device, some drives have a controller
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.

Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.

Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389


# 718cf2cc 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/dev: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# bb1c7be4 15-Oct-2017 Warner Losh <imp@FreeBSD.org>

Create general polling function for the nvme controller. Use it when
we're doing the various pin-based interrupt modes. Adjust
nvme_ctrlr_intx_handler to use nvme_ctrlr_poll.

Sponsored by: Netflix
Suggested by: scottl@


# 5fff95cc 20-Sep-2017 Warner Losh <imp@FreeBSD.org>

Fix queue depth for nda.

1/4 of the number of queues times queue entries is too limiting. It
works up to about 4k IOPS / 3.0GB/s for hardware that can do
4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues
in the fairness of nda's choice of TRIM vs READ. That will be fixed
separately.


# c02565f9 28-Aug-2017 Warner Losh <imp@FreeBSD.org>

Set the max transactions for NVMe drives better.

Provided a better estimate for the number of transactions that can be
pending at one time. This will be number of queues * number of
trackers / 4, as suggested by Jim Harris. This gives a better estimate
of the number of transactions that CAM should queue before applying
back pressure. This should be revisted when we have real multi-queue
support in CAM and the upper layers of the I/O stack.

Sponsored by: Netflix


# 696c9502 25-Aug-2017 Warner Losh <imp@FreeBSD.org>

NVME Namespace ID is 32-bits, so widen interface to reflect that.

Sponsored by: Netflix


# 824073fb 07-Mar-2017 Warner Losh <imp@FreeBSD.org>

Avoid dereferencing unintialized elements in the error path.

Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix


# a8a18dd5 07-Mar-2017 Warner Losh <imp@FreeBSD.org>

Make multi-namespace nvme drives more robust.

Fix assumptions about name spaces in NVME driver. First, it assumes
cdata.nn is the number of configured devices. However, it is the
number of supported name spaces. Second, it assumes that there will
never be more than 16 name spaces supported, but a certain drive I'm
testing reports 1024. It assumes that name spaces are a tightly packed
namespace, but the standard seems to indicate otherwise. Finally, it
assumes that an error would be generated when quearying an
unconfigured namespace. Instead, it succeeds but the identify data is
all zeros.

Fix these by limiting the number of name spaces we probe to 16. Remove
aborting when we find one in error. When the size of the name space is
zero, ignore it.

This is admittedly a bandaide. The long term fix will be to
participate in the enumeration and name space change protocols
definfed in the NVNe standard.

Sponsored by: Netflix


# a3a6c48d 02-Feb-2017 Warner Losh <imp@FreeBSD.org>

Ensure that the passthrough request will fit in MAXPHYS bytes after it
has been rounded to full pages. This avoids a panic in
vm_fault_quick_hold_pages due to this off-by-one error passing one
page too many into vmapbuf.


# a965389b 07-Nov-2016 Scott Long <scottl@FreeBSD.org>

Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by: jimharris
Obtained from: Netflix
Differential Revision: D8453


# f24c011b 10-Jun-2016 Warner Losh <imp@FreeBSD.org>

Commit the bits of nda that were missed. This should fix the build.

Approved by: re@


# 361e1fb4 23-Feb-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: fix intx handler to not dereference ioq during initialization

This was a regression from r293328, which deferred allocation
of the controller's ioq array until after interrupts are enabled
during boot.

PR: 207432
Reported and tested by: Andy Carrel <wac@google.com>
MFC after: 3 days
Sponsored by: Intel


# 43cd6160 18-Feb-2016 Justin Hibbits <jhibbits@FreeBSD.org>

Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any()

Since these calls only use default arguments, bus_alloc_resource_any() is the
right call.

Differential Revision: https://reviews.freebsd.org/D5306


# 7b036d77 11-Feb-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: avoid duplicate SET_NUM_QUEUES commands

nvme(4) issues a SET_NUM_QUEUES command during device
initialization to ensure enough I/O queues exists for each
of the MSI-X vectors we have allocated. The SET_NUM_QUEUES
command is then issued again during nvme_ctrlr_start(), to
ensure that is properly set after any controller reset.

At least one NVMe drive exists which fails this second
SET_NUM_QUEUES command during device initialization. So
change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES
command when it is coming out of a reset - avoiding the
duplicate SET_NUM_QUEUES during device initialization.

Reported by: gallatin
MFC after: 3 days
Sponsored by: Intel


# 9c6b5d40 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: replace NVME_CEILING macro with howmany()

Suggested by: rpokala
MFC after: 3 days


# 50dea2da 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: add hw.nvme.min_cpus_per_ioq tunable

Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.

This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue. This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues. Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.

While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.

Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel


# 2b647da7 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: do not revert o single I/O queue when per-CPU queues not possible

Previously nvme(4) would revert to a signle I/O queue if it could not
allocate enought interrupt vectors or NVMe submission/completion queues
to have one I/O queue per core. This patch determines how to utilize a
smaller number of available interrupt vectors, and assigns (as closely
as possible) an equal number of cores to each associated I/O queue.

MFC after: 3 days
Sponsored by: Intel


# d400f790 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: break out interrupt setup code into a separate function

MFC after: 3 days
Sponsored by: Intel


# e5af5854 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: do not pre-allocate MSI-X IRQ resources

The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after: 3 days
Sponsored by: Intel


# c75ad8ce 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: remove per_cpu_io_queues from struct nvme_controller

Instead just use num_io_queues to make this determination.

This prepares for some future changes enabling use of multiple
queues when we do not have enough queues or MSI-X vectors
for one queue per CPU.

MFC after: 3 days
Sponsored by: Intel


# d85f84ab 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: simplify some of the nested ifs in interrupt setup code

This prepares for some follow-up commits which do more work in
this area.

MFC after: 3 days
Sponsored by: Intel


# fade8dd7 23-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.

In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division


# cbdec09c 23-Jul-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: ensure csts.rdy bit is cleared before returning from nvme_ctrlr_disable

PR: 200458
MFC after: 3 days
Sponsored by: Intel


# de9a58f4 23-Jul-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: properly handle case where pci_alloc_msix does not alloc all vectors

Reported by: Sean Kelly <smkelly@smkelly.org>
MFC after: 3 days
Sponsored by: Intel


# 36b0e4ee 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: remove CHATHAM related code

Chatham was an internal NVMe prototype board used for
early driver development.

MFC after: 1 week
Sponsored by: Intel


# e5ce5379 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: fall back to a smaller MSI-X vector allocation if necessary

Previously, if per-CPU MSI-X vectors could not be allocated,
nvme(4) would fall back to INTx with a single I/O queue pair.
This change will still fall back to a single I/O queue pair, but
allocate MSI-X vectors instead of reverting to INTx.

MFC after: 1 week
Sponsored by: Intel


# f42ca756 18-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by: Intel
MFC after: 3 days


# 496a2752 18-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Close hole where nvd(4) would not be notified of all nvme(4)
instances if modules loaded during boot.

Sponsored by: Intel
MFC after: 3 days


# 2b26030c 17-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Remove the software progress marker SET_FEATURE command during
controller initialization.

The spec says OS drivers should send this command after controller
initialization completes successfully, but other NVMe OS drivers are
not sending this command. This change will therefore reduce differences
between the FreeBSD and other OS drivers.

Sponsored by: Intel
MFC after: 3 days


# 448cffc8 06-Jan-2014 Jim Harris <jimharris@FreeBSD.org>

For IDENTIFY passthrough commands to Chatham prototype controllers, copy
the spoofed identify data into the user buffer rather than issuing the
command to the controller, since Chatham IDENTIFY data is always spoofed.

While here, fix a bug in the spoofed data for Chatham submission and
completion queue entry sizes.

Sponsored by: Intel
MFC after: 3 days


# d603c3d7 01-Nov-2013 Jim Harris <jimharris@FreeBSD.org>

Create a unique unit number for each controller and namespace cdev.

Sponsored by: Intel
MFC after: 3 days


# bb2f67fd 08-Oct-2013 Jim Harris <jimharris@FreeBSD.org>

Log and then disable asynchronous notification of persistent events after
they occur.

This prevents repeated notifications of the same event.

Status of these events may be viewed at any time by viewing the
SMART/Health Info Page using nvmecontrol, whether or not asynchronous
events notifications for those events are enabled. This log page can
be viewed using:

nvmecontrol logpage -p 2 <ctrlr id>

Future enhancements may re-enable these notifications on a periodic basis
so that if the notified condition persists, it will continue to be logged.

Sponsored by: Intel
Reviewed by: carl
Approved by: re (hrs)
MFC after: 1 week


# d5fc9821 08-Oct-2013 Jim Harris <jimharris@FreeBSD.org>

Do not enable temperature threshold as an asynchronous event notification
on NVMe controllers that do not support it.

Sponsored by: Intel
Reviewed by: carl
Approved by: re (hrs)
MFC after: 1 week


# 56183abc 13-Aug-2013 Jim Harris <jimharris@FreeBSD.org>

Send a shutdown notification in the driver unload path, to ensure
notification gets sent in cases where system shuts down with driver
unloaded.

Sponsored by: Intel
Reviewed by: carl
MFC after: 3 days


# 8e0ac13f 17-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Use pause() instead of DELAY() when polling for completion of admin
commands during controller initialization.

DELAY() does not work here during config_intrhook context - we need to
explicitly relinquish the CPU for the admin command completion to
get processed.

Sponsored by: Intel
Reported by: Adam Brooks <adam.j.brooks@intel.com>
Reviewed by: carl
MFC after: 3 days


# e9efbc13 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Update copyright dates.

MFC after: 3 days


# ec526ea9 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Do not retry failed async event requests.

Sponsored by: Intel
MFC after: 3 days


# 7b68ae1e 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Fail any passthrough command whose transfer size exceeds the controller's
max transfer size. This guards against rogue commands coming in from
userspace.

Also add KASSERTS for the virtual address and unmapped bio cases, if the
transfer size exceeds the controller's max transfer size.

Sponsored by: Intel
MFC after: 3 days


# 8d09e3c4 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Use MAXPHYS to specify the maximum I/O size for nvme(4).

Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed. The larger I/O size is very
beneficial for upcoming firmware download support. This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.

Sponsored by: Intel
MFC after: 3 days


# 5076698e 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Remove the NVME_IDENTIFY_CONTROLLER and NVME_IDENTIFY_NAMESPACE IOCTLs and replace
them with the NVMe passthrough equivalent.

Sponsored by: Intel


# 7c3f19d7 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Add support for passthrough NVMe commands.

This includes a new IOCTL to support a generic method for nvmecontrol(8) to pass
IDENTIFY, GET_LOG_PAGE, GET_FEATURES and other commands to the controller, rather than
separate IOCTLs for each.

Sponsored by: Intel


# a90b8104 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Rename the controller's fail_req_lock, so that it can be used for other
locking operations on the controller.

Sponsored by: Intel


# 1e526bc4 29-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or
NULL. This simplifies decisions around if/how requests are routed through
busdma. It also paves the way for supporting unmapped bios.

Sponsored by: Intel


# bb852ae8 28-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Delete extra IO qpairs allocated based on number of MSI-X vectors, but
later found to not be usable because the controller doesn't support the
same number of queues.

This is not the normal case, but does occur with the Chatham prototype
board.

Sponsored by: Intel


# 547d523e 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Clean up debug prints.

1) Consistently use device_printf.
2) Make dump_completion and dump_command into something more
human-readable.

Sponsored by: Intel
Reviewed by: carl


# 237d2019 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Change a number of malloc(9) calls to use M_WAITOK instead of
M_NOWAIT.

Sponsored by: Intel
Suggested by: carl
Reviewed by: carl


# 955910a9 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Replace usages of mtx_pool_find used for admin commands with a polling
mechanism.

Now that all requests are timed, we are guaranteed to get a completion
notification, even if it is an abort status due to a timed out admin
command.

This has the effect of simplifying the controller and namespace setup
code, so that it reads straight through rather than broken up into
a bunch of different callback functions.

Sponsored by: Intel
Reviewed by: carl


# 232e2edb 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add the ability to internally mark a controller as failed, if it is unable to
start or reset. Also add a notifier for NVMe consumers for controller fail
conditions and plumb this notifier for nvd(4) to destroy the associated
GEOM disks when a failure occurs.

This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state. To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.

Sponsored by: Intel
Reviewed by: carl


# 3d7eb41c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Just disable the controller instead of deleting IO queues during detach.

This is just as effective, and removes the need for a bunch of admin commands
to a controller that's going to be disabled shortly anyways.

Sponsored by: Intel
Reviewed by: carl


# 74019d4b 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Set Pre-boot Software Load Count to 0 at the end of the controller
start process.

The spec indicates the OS driver should use Set Features (Software
Progress Marker) to set the pre-boot software load count to 0
after the OS driver has successfully been initialized. This allows
pre-boot software to determine if there have been any issues with the
OS loading.

Sponsored by: Intel
Reviewed by: carl


# be34f216 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Remove the is_started flag from struct nvme_controller.

This flag was originally added to communicate to the sysctl code
which oids should be built, but there are easier ways to do this. This
needs to be cleaned up prior to adding new controller states - for example,
controller failure.

Sponsored by: Intel
Reviewed by: carl


# 02e33484 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Ensure the controller's MDTS is accounted for in max_xfer_size.

The controller's IDENTIFY data contains MDTS (Max Data Transfer Size) to
allow the controller to specify the maximum I/O data transfer size. nvme(4)
already provides a default maximum, but make sure it does not exceed what
MDTS reports.

Sponsored by: Intel
Reviewed by: carl


# cb5b7c13 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Cap the number of retry attempts to a configurable number. This ensures
that if a specific I/O repeatedly times out, we don't retry it indefinitely.

The default number of retries will be 4, but is adjusted using hw.nvme.retry_count.

Sponsored by: Intel
Reviewed by: carl


# 0d7e13ec 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Pass associated log page data to async event consumers, if requested.

Sponsored by: Intel
Reviewed by: carl


# 2868353a 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

When an asynchronous event request is completed, automatically fetch the
specified log page.

This satisfies the spec condition that future async events of the same type
will not be sent until the associated log page is fetched.

Sponsored by: Intel
Reviewed by: carl


# cf81529c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Create struct nvme_status.

NVMe error log entries include status, so breaking this out into
its own data structure allows it to be included in both the
nvme_completion data structure as well as error log entry data
structures.

While here, expose nvme_completion_is_error(), and change all of
the places that were explicitly looking at sc/sct bits to use this
macro instead.

Sponsored by: Intel
Reviewed by: carl


# f37c22a3 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Make nvme_ctrlr_reset a nop if a reset is already in progress.

This protects against cases where a controller crashes with multiple
I/O outstanding, each timing out and requesting controller resets
simultaneously.

While here, remove a debugging printf from a previous commit, and add
more logging around I/O that need to be resubmitted after a controller
reset.

Sponsored by: Intel
Reviewed by: carl


# 48ce3178 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

By default, always escalate to controller reset when an I/O times out.

While aborts are typically cleaner than a full controller reset, many times
an I/O timeout indicates other controller-level issues where aborts may not
work. NVMe drivers for other operating systems are also defaulting to
controller reset rather than aborts for timed out I/O.

Sponsored by: Intel
Reviewed by: carl


# 94143332 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add a tunable for the I/O timeout interval. Default is still 30 seconds,
but can be adjusted between a min/max of 5 and 120 seconds.

Sponsored by: Intel
Reviewed by: carl


# 12d191ec 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add handling for controller fatal status (csts.cfs).

On any I/O timeout, check for csts.cfs==1. If set, the controller
is reporting fatal status and we reset the controller immediately,
rather than trying to abort the timed out command.

This changeset also includes deferring the controller start portion
of the reset to a separate task. This ensures we are always performing
a controller start operation from a consistent context.

Sponsored by: Intel
Reviewed by: carl


# dbba7442 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add API for nvme consumers to access controller and namespace identify data.

Sponsored by: Intel
Reviewed by: carl


# b846efd7 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add controller reset capability to nvme(4) and ability to explicitly
invoke it from nvmecontrol(8).

Controller reset will be performed in cases where I/O are repeatedly
timing out, the controller reports an unrecoverable condition, or
when explicitly requested via IOCTL or an nvme consumer. Since the
controller may be in such a state where it cannot even process queue
deletion requests, we will perform a controller reset without trying
to clean up anything on the controller first.

Sponsored by: Intel
Reviewed by: carl


# 038a5ee4 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add an interface for nvme shim drivers (i.e. nvd) to register for
notifications when new nvme controllers are added to the system.

Sponsored by: Intel


# 0a0b08cc 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Enable asynchronous event requests on non-Chatham devices.

Also add logic to clean up all outstanding asynchronous event requests
when resetting or shutting down the controller, since these requests
will not be explicitly completed by the controller itself.

Sponsored by: Intel


# 990e741c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Move controller destruction code from nvme_detach() to new nvme_ctrlr_destruct()
function.

Sponsored by: Intel


# 4b52061e 07-Mar-2013 David E. O'Brien <obrien@FreeBSD.org>

Fix GCC build:
/usr/src/sys/modules/nvme/../../dev/nvme/nvme.c:211: warning: format '%qx' expects type 'long unsigned int', but argument 9 has type 'long long unsigned int' [-Wformat]


# 91fe20e3 18-Dec-2012 Jim Harris <jimharris@FreeBSD.org>

Map BAR 4/5, because NVMe spec says devices may place the MSI-X table
behind BAR 4/5, rather than in BAR 0/1 with the control/doorbell registers.

Sponsored by: Intel


# 4d6abcb1 18-Dec-2012 Jim Harris <jimharris@FreeBSD.org>

Do not use taskqueue to defer completion work when using INTx. INTx now
matches MSI-X behavior.

Sponsored by: Intel


# 21b6da58 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Preallocate a limited number of nvme_tracker objects per qpair, rather
than dynamically creating them at runtime.

Sponsored by: Intel


# 5ae9ed68 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Create nvme_qpair_submit_request() which eliminates all of the code
duplication between the admin and io controller-level submit
functions.

Sponsored by: Intel


# c2e83b40 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Simplify how the qpair lock is acquired and released.

Sponsored by: Intel


# 5fa5cc5f 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Cleanup uio-related code to use struct nvme_request and
nvme_ctrlr_submit_io_request().

While here, also fix case where a uio may have more than 1 iovec.
NVMe's definition of SGEs (called PRPs) only allows for the first SGE to
start on a non-page boundary. The simplest way to handle this is to
construct a temporary uio for each iovec, and submit an NVMe request
for each.

Sponsored by: Intel


# d281e8fb 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add nvme_ctrlr_submit_[admin|io]_request functions which consolidates
code for allocating nvme_tracker objects and making calls into
bus_dmamap_load for commands which have payloads.

Sponsored by: Intel


# 8a382371 18-Sep-2012 Jim Harris <jimharris@FreeBSD.org>

Add #if 0 around nvme_async_event_cb() until NVMe AER functionality
can be tested.

This fixes a build warning found only with clang.


# bb0ec6b3 17-Sep-2012 Jim Harris <jimharris@FreeBSD.org>

This is the first of several commits which will add NVM Express (NVMe)
support to FreeBSD. A full description of the overall functionality
being added is below. nvmexpress.org defines NVM Express as "an optimized
register interface, command set and feature set fo PCI Express (PCIe)-based
Solid-State Drives (SSDs)."

This commit adds nvme(4) and nvd(4) driver source code and Makefiles
to the tree.

Full NVMe functionality description:
Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe)
device support.

There will continue to be ongoing work on NVM Express support, but there
is more than enough to allow for evaluation of pre-production NVM Express
devices as well as soliciting feedback. Questions and feedback are welcome.

nvme(4) implements NVMe hardware abstraction and is a provider of NVMe
namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN.
nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks.
nvmecontrol(8) is used for NVMe configuration and management.

The following are currently supported:
nvme(4)
- full mandatory NVM command set support
- per-CPU IO queues (enabled by default but configurable)
- per-queue sysctls for statistics and full command/completion queue
dumps for debugging
- registration API for NVMe namespace consumers
- I/O error handling (except for timeoutsee below)
- compilation switches for support back to stable-7

nvd(4)
- BIO_DELETE and BIO_FLUSH (if supported by controller)
- proper BIO_ORDERED handling

nvmecontrol(8)
- devlist: list NVMe controllers and their namespaces
- identify: display controller or namespace identify data in
human-readable or hex format
- perftest: quick and dirty performance test to measure raw
performance of NVMe device without userspace/physio/GEOM
overhead

The following are still work in progress and will be completed over the
next 3-6 months in rough priority order:
- complete man pages
- firmware download and activation
- asynchronous error requests
- command timeout error handling
- controller resets
- nvmecontrol(8) log page retrieval

This has been primarily tested on amd64, with light testing on i386. I
would be happy to provide assistance to anyone interested in porting
this to other architectures, but am not currently planning to do this
work myself. Big-endian and dmamap sync for command/completion queues
are the main areas that would need to be addressed.

The nvme(4) driver currently has references to Chatham, which is an
Intel-developed prototype board which is not fully spec compliant.
These references will all be removed over time.

Sponsored by: Intel
Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>