History log of /freebsd-current/sys/dev/nvme/nvme_qpair.c
Revision Date Author Comments
# 0dd84c3b 13-May-2024 Warner Losh <imp@FreeBSD.org>

nvme: Add comment about where tr->deadline is set

It's easy to overlook the chain of events that lead to tr->deadline
being updated. Add a comment here to explain what otherwise looks like
an oversight w/o careful study.

Sponsored by: Netflix


# c931cf6a 13-May-2024 Warner Losh <imp@FreeBSD.org>

nvme: Slight simplification

We don't need to dereference qpair to get the ctrlr pointer each time,
so use the cached value. It's not going to change. No change intended.

Sponsored by: Netflix


# 9db8ca92 13-May-2024 Warner Losh <imp@FreeBSD.org>

nvme: Slight reworking this loop to match FreeBSD style

Update the comment for the code, and slightly rework the code in the
'fast exit' paradigm that FreeBSD generally tries to do.

Sponsored by: Netflix


# 5a178b83 13-May-2024 Warner Losh <imp@FreeBSD.org>

nvme: Add locking asserts

nvme_qpair_complete_tracker and nvme_qpair_manual_complete_tracker have
to be called without the qpair lock, so assert its unowned.

Sponsored by: Netflix


# 5650bd3f 29-Jan-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Use the NVMEF macro to construct fields

Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43605


# 479680f2 29-Jan-2024 John Baldwin <jhb@FreeBSD.org>

nvme: Use the NVMEV macro instead of expanded versions

Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43595


# fdafd315 24-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Automated cleanup of cdefs and other formatting

Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.

Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/

Sponsored by: Netflix


# 8d6c0743 06-Nov-2023 Alexander Motin <mav@FreeBSD.org>

nvme: Introduce longer timeouts for admin queue

KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.

Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454


# afc3d49b 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: Close a race in destroying qpair and timeouts

While we should have cleared all the pending I/O prior to calling
nvme_qpair_destroy, which should ensure that if the callout_drain causes
a call to nvme_qpair_timeout(), it won't schedule any new
timeout. However, it doesn't hurt to set timeout_pending to false in
nvme_qpair_destroy() and have nvme_qpair_timeout() exit early if it sees
it w/o scheduling a timeout. Since we don't otherwise stop the timeout
until we're about to destroy the qpair, this ensures we fail safe. The
lock/unlock also ensures the callout_drain will either remove the callout,
or wait for it to run with the early bailout.

We can likely further improve this by using callout_stop() inside the
pending lock. I'll investigate that for future refinement.

Sponsored by: Netflix
Suggestions by: jhb
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D42065


# 9cd7b624 10-Oct-2023 Warner Losh <imp@FreeBSD.org>

nvme: Eliminate RECOVERY_FAILED state

While it seemed like a good idea to have this state, we can do
everything we wanted with the state by checking ctrlr->is_failed since
that's set before we start failing the qpairs. Add some comments about
racing when we're failing the controller, though in practice I'm not
sure that kind of race could even be lost.

Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42051


# 1d6021cd 25-Sep-2023 Warner Losh <imp@FreeBSD.org>

nvme: Supress noise messages

When we're suspending, we get messages about waiting for the controller
to reset. These are in error: we're not waiting for it to reset. We put
the recovery state as part of suspending, so we should suppress these as
a false positive.

Also remove a stray debug that's left over from earlier versions of
the recovery code that no longer makes sense.

Sponsored by: Netflix


# da8324a9 24-Sep-2023 Warner Losh <imp@FreeBSD.org>

nvme: Fix locking protocol violation to fix suspend / resume

Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.

Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.

Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.

Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.

Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.

These are all interrelated, so need to be done at the same time.

Sponsored by: Netflix
Reviewed by: jhb
Tested by: jhb (made sure suspend / resume worked)
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D41866


# d9543162 15-Sep-2023 Warner Losh <imp@FreeBSD.org>

nvme: Give up when we've failed

Normally, we poll the device every so often to see if commands have
timed out. However, we'll go into the recovery state as part of failing
the drive. To account for all possibilties, if we're failed when we get
into the polling function, just stop polling: Party is over.

Sponsored by: Netflix


# 8052b01e 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add exclusion for ISR

Add a basically uncontended spinlock that we take out while the ISR is
running. This has two effects: First, when we get a timeout, we can
safely call the nvme_qpair_process_completions w/o racing any ISRs.
Second, we can use it to ensure that we don't reset the card while
the ISRs are active (right now we just sleep and hope for the best,
which usually is fine, but not always).

Sponsored by: Netflix
MFC After: 2 weeks
Reviewed by: chuck, gallatin
Differential Revision: https://reviews.freebsd.org/D41452


# d4959bfc 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Greatly improve error recovery

Next phase of error recovery: Eliminate the REOVERY_START phase, since
we don't need to wait to start recovery. Eliminate the RECOVERY_RESET
phase since it is transient, we now transition from RECOVERY_NORMAL into
RECOVERY_WAITING.

In normal mode, read the status of the controller. If it is in failed
state, or appears to be hot-plugged, jump directly to reset which will
sort out the proper things to do. This will cause all pending I/O to
complete with an abort status before the reset.

When in the NORMAL state, call the interrupt handler. This will complete
all pending transactions when interrupts are broken or temporarily
misbehaving. We then check all the pending completions for timeouts. If
we have abort enabled, then we'll send an abort. Otherwise we'll assume
the controller is wedged and needs a reset. By calling the interrupt
handler here, we'll avoid an issue with the current code where we
transitioned to RECOVERY_START which prevented any completions from
happening. Now completions happen. In addition and follow-on I/O that is
scheduled in the completion routines will be submitted, rather than
queued, because the recovery state is correct. This also fixes a problem
where I/O would timeout, but never complete, leading to hung I/O.

Resetting remains the same as before, just when we chose to reset has
changed.

A nice side effect of these changes is that we now do I/O when
interrupts to the card are totally broken. Followon commits will improve
the error reporting and logging when this happens. Performance will be
aweful, but will at least be minimally functional.

There is a small race when we're checking the completions if interrupts
are working, but this is handled in a future commit.

Sponsored by: Netflix
MFC After: 2 weeks
Differential Revision: https://reviews.freebsd.org/D36922


# 2a6b7055 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Timeout expired transactions

When we went to having a shared timeout routine, failing the timed-out
transaction code was inadvertantly dropped. Reinstate it.

Fixes: 502dc84a8b670
Sponsored by: Netflix
MFC After: 2 weeks
Reviewed by: chuck, jhb
Differential Revision: https://reviews.freebsd.org/D36921


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 2ad9a815 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Directly lookup op code

Rather than have a table to walk through, use a sparse array.

Suggested by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D41353


# 95cd10f1 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add comments about other fields in status

When manually completing an I/O, we do so because we have no status back
from the card. Note M, CRD and P are all 0 because this is an artificial
event (and phase isn't checked when it's completed this way). There's no
MORE information in the error log page and there's no delayed retry
(CRD=0) and we don't currently request CRD to be set to anything other
than 0 and thus don't implement delayed retry.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41314


# a510dbc8 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Be less verbose when cancelling I/O or admin commands

When we're resetting, and there's outstanding I/O that we're cancelling,
only report we're cancelling the I/O once rather than once per
I/O. Likewise when we reschedule the I/O. We don't need to say for each
one that we're cancelling/rescheduling something, and then report the
I/O that we're doing. Likewise with cancelling admin commands (we never
retry them here, so a similar change isn't needed).

Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D41313


# ac8c866f 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add more NVME Base Spec 2.0 and NVME Command Set Spec 1.0a

Add admin commands capacity management, lockdown and fabrics commands.
Add I/O copy command.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41311


# edd23e4d 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Eliminate redundant code

get_admin_opcode_string and get_io_opcode_string are identical, but
start with different tables. Use a helper routine that takes an argument
to implement these instead. A future commit will refine this further.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41310


# 7be0b068 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate command printing routine

Both nvme_dump_command and nvme_qpair_print_command print nvme
commands. The former latter better. Recode the one call to
nvme_dump_command to use nvme_qpair_print_command and delete the
former. No sense having two nearly identical routines. A future commit
will convert to sbuf.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41309


# 6f76d493 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate completion printing routine

Both nvme_dump_completion and nvme_qpair_print_completion print
completions. The latter is better. Recode the two instances of
nvme_dump_completion to use nvme_qpair_print_completion and delete the
former. No sense having two nearly identical routines. A future commit
will convert this to sbuf.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D41308


# 92103adb 24-Jul-2023 John Baldwin <jhb@FreeBSD.org>

nvme: Use a memdesc for the request buffer instead of a bespoke union.

This avoids encoding CAM-specific knowledge in nvme_qpair.c.

Reviewed by: chuck, imp, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41119


# 5ae44634 27-Jun-2023 John Baldwin <jhb@FreeBSD.org>

nvme: Fix typo in "Command Aborted by Host" constant name.

Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D40763


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 49ebbdb2 08-Mar-2023 Alexander Motin <mav@FreeBSD.org>

Add NAMESPACE MANAGEMENT into admin_opcode[].

MFC after: 1 week


# 4982884b 11-Oct-2022 Warner Losh <imp@FreeBSD.org>

nvme: Always set deadline to max

When a transaction is on the outstanding list, it needs to have a valid
timeout value, so set it to infinity before placing it on the
list. Place before we put it on the list, even though the list is
protected by the qpair lock.

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D36920


# a69c0964 05-Aug-2022 Alexander Motin <mav@FreeBSD.org>

nvme: Print CRD, M and DNR status bits on errors.

It may help with some issues debugging.

MFC after: 1 week


# 0fd4cd40 15-Apr-2022 Warner Losh <imp@FreeBSD.org>

nvme: Use controller's page size instead of PAGE_SIZE to create qpair

When constructing qpair, use the controller's notion of page size rather
than the host's PAGE_SIZE. Currently, these are both 4k, but the arm 16k
page size support requires decoupling.

There's a "hidden" PAGE_SIZE in btoc, so we must change btoc(x) to
howmany(x, ctrlr->page_size) to properly count the number of pages (in
the drive's world view) are needed for various calculations.

With these changes, we the nvme driver operates at production level load
for both host 4k and host 16k page size.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34873


# dfa01f4f 08-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

nvme(4): Fix a typo in a source code comment

- s/is is/is/

MFC after: 3 days


# b3c9b606 06-Jan-2022 Alexander Motin <mav@FreeBSD.org>

nvme: Do not rearm timeout for commands without one.

Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding.
They have no timeouts, but their presence in qpair->outstanding_tr caused
useless timeout callout rearming twice a second.

While there, relax timeout callout period from 0.5s to 0.5-1s to improve
aggregation. Command timeouts are measured in seconds, so we don't need
to be precise here.

Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D33781


# 2ec165e3 14-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Reduce traffic to the doorbell register

Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).

Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32470


# 18dc12bf 12-Oct-2021 Warner Losh <imp@FreeBSD.org>

nvme: Restore hotplug warning

Restore hotplug warning in recovery state machine. No functional change
other than what message gets printed.

Sponsored by: Netflix


# 36a87d0c 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Sanity check completion id

Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.

Sponsored by: Netflix
Reviewed by: mav, chs, chuck
Differential Revision: https://reviews.freebsd.org/D32088


# 587aa255 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: count number of ignored interrupts

Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.

Sponsored by: Netflix
Reviewed by: mav, chuck
Differential Revision: https://reviews.freebsd.org/D32212


# 7d5eebe0 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Add sanity check for phase on startup.

The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.

Sponsored by: Netflix
Reviewed by: cperciva, mav (prior version)
Differential Revision: https://reviews.freebsd.org/D32211


# fa81f373 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: start qpair in state RECOVERY_WAITING

An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or to process it if we are. If we are initialized, we'll
see there's no completion records and return. If we're not, we'll
process no completion records and return. Either way, nothing is
processed and nothing is lost.

Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.

This fixes an KASSERT "received completion for unknown cmd" panic on
boot.

Fixes: 502dc84a8b6703e7c0626739179a3cdffdd22d81
Sponsored by: Netflix
Reviewed by: mav, cperciva, gallatin
Differential Revision: https://reviews.freebsd.org/D32210


# 502dc84a 23-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Use shared timeout rather than timeout per transaction

Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28583


# 4b977e6d 17-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme/nda: Fail all nvme I/Os after controller fails

Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.

Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31341


# e3bdf3da 31-Aug-2021 Alexander Motin <mav@FreeBSD.org>

nvme(4): Add MSI and single MSI-X support.

If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared. If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios. For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after: 1 week


# fc9a0840 15-Jul-2021 Warner Losh <imp@FreeBSD.org>

nvme: Enable interrupts after qpair fully constructed

To guard against the ill effects of a spurious interrupt during
construction (or one that was bogusly pending), enable interrupts after
the qpair is completely constructed. Otherwise, we can die with null
pointer dereferences in nvme_qpair_process_completions. This has been
observed in at least one pre-release NVMe drive where the MSIX interrupt
fired while the queue was being created, before we'd started the NVMe
controller card.

The alternative of only turning on the interrupts after the rest was
tried, but was insufficient to work around this bug and made the code
more complicated w/o benefit.

Reviewed by: mav, chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31182


# aa0ab681 02-Jul-2021 Warner Losh <imp@FreeBSD.org>

nvme: coherently read status of completion records

Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.

To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.

Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002


# 9600aa31 08-Feb-2021 Warner Losh <imp@FreeBSD.org>

nvme: use NVME_GONE rather than hard-coded 0xffffffff

Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.


# 082905ca 04-Dec-2020 Warner Losh <imp@FreeBSD.org>

nvme: Remove a wmb() that's not necessary.

bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.

For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.

For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.

Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448


# 8f9d5a8d 02-Dec-2020 Michal Meloun <mmel@FreeBSD.org>

NVME: Multiple busdma related fixes.
- in nvme_qpair_process_completions() do dma sync before completion buffer
is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
buffers continuously owned (and accessed) by DMA must be allocated with this
flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
(or coherent buses in mixed systems).

MFC after: 4 weeks
Reviewed by: mav, imp
Differential Revision: https://reviews.freebsd.org/D27446


# 8d08cdc7 02-Dec-2020 Chuck Tuffli <chuck@FreeBSD.org>

nvme: Fix typo in definition

Change occurrences of "selt test" to "self tests in the NVMe header
file.

Reviewed by: imp, mav
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27439


# ac90f70d 28-Nov-2020 Alexander Motin <mav@FreeBSD.org>

Increase nvme(4) maximum transfer size from 1MB to 2MB.

With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs. The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# d87b31e1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

nvme: clean up empty lines in .c and .h files


# 96ad26ee 04-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove free_domain() and uma_zfree_domain().

These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937


# ead7e103 18-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Make polled request timeout less invasive.

Instead of panic after one second of polling, make the normal timeout
handler to activate, reset the controller and abort the outstanding
requests. If all of it won't happen within 10 seconds then something
in the driver is likely stuck bad and panic is the only way out.

In particular this fixed device hot unplug during execution of those
polled commands, allowing clean device detach instead of panic.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 550d5d64 17-Jun-2020 Alexander Motin <mav@FreeBSD.org>

Fix admin qpair leak if detached during initial reset.

MFC after: 1 week
Sponsored by: iXsystems, Inc.


# 4053f8ac 02-May-2020 David Bright <dab@FreeBSD.org>

Fix various Coverity-detected errors in nvme driver

This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by: imp@, vangyzen@
MFC after: 5 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24532


# aeb665b5 30-Mar-2020 Ed Maste <emaste@FreeBSD.org>

remove extraneous double ;s in sys/


# 0a4b14e8 15-Dec-2019 Michal Meloun <mmel@FreeBSD.org>

Properly synchronize completion DMA buffers.
Within command completion processing the callback function may access
DMAed data buffer. Synchronize it before use, not after.
This allows to use NVMe disk on non-DMA coherent arm64 system.

MFC after: 3 weeks


# 7588c6cc 13-Dec-2019 Warner Losh <imp@FreeBSD.org>

Move to using bool instead of boolean_t

While there are subtle semantic differences between bool and boolean_t, none of
them matter in these cases. Prefer true/false when dealing with bool
type. Preserve a couple of TRUEs since they are passed into int args into CAM.
Preserve a couple of FALSEs when used for status.done, an int.

Differential Revision: https://reviews.freebsd.org/D20999


# 43393e8b 06-Dec-2019 Warner Losh <imp@FreeBSD.org>

trackers always know what qpair they are on

Don't needlessly pass around qpair pointers when the tracker knows what
qpair it's on. This will simplify code and make it easier to split
submission and completion queues in the future.

Signed-off-by: John Meneghini <johnm@netapp.com>


# 1eab19cb 23-Sep-2019 Alexander Motin <mav@FreeBSD.org>

Make nvme(4) driver some more NUMA aware.

- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after: 1 month
Sponsored by: iXsystems, Inc.


# f93b7f95 04-Sep-2019 Warner Losh <imp@FreeBSD.org>

Support doorbell strides != 0.

The NVMe standard (1.4) states

>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.

However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.

MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514


# 71a28181 21-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Improve NVMe hot unplug handling.

If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# a6d222eb 02-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Add more random bits from NVMe 1.4.

MFC after: 2 weeks


# 90dfa8f0 01-Aug-2019 Alexander Motin <mav@FreeBSD.org>

Add more new fields and values from NVMe 1.4.

MFC after: 2 weeks


# 5e83c2ff 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of commands that exhaust their retry limit.

While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.


# c37fc318 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of retried commands.

Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.


# c75bdc04 18-Jul-2019 Warner Losh <imp@FreeBSD.org>

Provide new tunable hw.nvme.verbose_cmd_dump

The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.

Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988


# d0aaeffd 01-Jun-2019 Warner Losh <imp@FreeBSD.org>

Since a fatal trap can happen at aribtrary times, don't panic when the
completions are not in a consistent state. Cope with the different
places the normal I/O completion polling thread can be interrupted and
then re-entered during a kernel panic + dump.

Reviewed by: jhb and markj (both prior versions)
Differential Revision: https://reviews.freebsd.org/D20478


# 2ffd6fce 08-Mar-2019 Warner Losh <imp@FreeBSD.org>

Don't print all the I/O we abort on a reset, unless we're out of
retries.

When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.

Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431


# 95108cad 02-Mar-2019 Warner Losh <imp@FreeBSD.org>

Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this)


# 45d7e233 27-Feb-2019 Warner Losh <imp@FreeBSD.org>

Unconditionally support unmapped BIOs. This was another shim for
supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.


# d706306d 27-Feb-2019 Warner Losh <imp@FreeBSD.org>

Remove #ifdef code to support FreeBSD versions that haven't been
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.


# a6461357 26-Dec-2018 Alexander Motin <mav@FreeBSD.org>

Add descriptions to NVMe interrupts.

MFC after: 1 month


# 9544e6dc 21-Aug-2018 Chuck Tuffli <chuck@FreeBSD.org>

Make NVMe compatible with the original API

The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).

This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.

This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.

Bump __FreeBSD_version to 1200081 due to API change

Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404


# 2e0090af 03-Aug-2018 Justin Hibbits <jhibbits@FreeBSD.org>

nvme(4): Add bus_dmamap_sync() at the end of the request path

Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often. With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix. Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.

Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570


# c6c70c07 30-Apr-2018 Alexander Motin <mav@FreeBSD.org>

Fix use-after-free in nvme_qpair_destroy().

dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.

Sponsored by: iXsystems, Inc.


# d85d9648 15-Mar-2018 Warner Losh <imp@FreeBSD.org>

Try polling the qpairs on timeout.

On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.

This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.

PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix


# 6b1a96b1 10-Mar-2018 Alexander Motin <mav@FreeBSD.org>

Add new opcodes and statuses from NVMe 1.3a.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 0d787e9b 22-Feb-2018 Wojciech Macek <wma@FreeBSD.org>

NVMe: Add big-endian support

Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916


# 718cf2cc 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/dev: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 51977281 29-Aug-2017 Warner Losh <imp@FreeBSD.org>

Add CAM/NVMe support for CAM_DATA_SG

This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@


# 824073fb 07-Mar-2017 Warner Losh <imp@FreeBSD.org>

Avoid dereferencing unintialized elements in the error path.

Some drives sometimes have errors for things like setting the number
of queue entries in the submission queue. The error paths taken for
these drives ensure a panic dereferencing uninialized data.

Sponsored by: Netflix


# a965389b 07-Nov-2016 Scott Long <scottl@FreeBSD.org>

Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a
bunch of safery belts and error handling in related codepaths.

Reviewed by: jimharris
Obtained from: Netflix
Differential Revision: D8453


# e5af5854 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: do not pre-allocate MSI-X IRQ resources

The issue referenced here was resolved by other changes
in recent commits, so this code is no longer needed.

MFC after: 3 days
Sponsored by: Intel


# 3345ed9a 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: use BUS_SPACE_MAXSIZE for bus_dma_tag_create maxsize parameter

This fixes i386 PAE build fallout from r281281.

Reported by: bz
MFC after: 1 week


# 36b0e4ee 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: remove CHATHAM related code

Chatham was an internal NVMe prototype board used for
early driver development.

MFC after: 1 week
Sponsored by: Intel


# a6e30963 08-Apr-2015 Jim Harris <jimharris@FreeBSD.org>

nvme: create separate DMA tag for non-payload DMA buffers

Submission and completion queue memory need to use a
separate DMA tag for mappings than payload buffers,
to ensure mappings remain contiguous even with DMAR
enabled.

Submitted by: kib
MFC after: 1 week
Sponsored by: Intel


# f42ca756 18-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: Allocate all MSI resources up front so that we can fall back to
INTx if necessary.

Sponsored by: Intel
MFC after: 3 days


# 1416ef36 17-Mar-2014 Jim Harris <jimharris@FreeBSD.org>

nvme: NVMe specification dictates 4-byte alignment for PRPs (not 8).

Sponsored by: Intel
MFC after: 3 days


# e9efbc13 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Update copyright dates.

MFC after: 3 days


# bbd412dd 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Remove remaining uio-related code.

The nvme_physio() function was removed quite a while ago, which was the
only user of this uio-related code.

Sponsored by: Intel
MFC after: 3 days


# 7b68ae1e 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Fail any passthrough command whose transfer size exceeds the controller's
max transfer size. This guards against rogue commands coming in from
userspace.

Also add KASSERTS for the virtual address and unmapped bio cases, if the
transfer size exceeds the controller's max transfer size.

Sponsored by: Intel
MFC after: 3 days


# 8d09e3c4 26-Jun-2013 Jim Harris <jimharris@FreeBSD.org>

Use MAXPHYS to specify the maximum I/O size for nvme(4).

Also allow admin commands to transfer up to this maximum I/O size, rather
than the artificial limit previously imposed. The larger I/O size is very
beneficial for upcoming firmware download support. This has the added
benefit of simplifying the code since both admin and I/O commands now use
the same maximum I/O size.

Sponsored by: Intel
MFC after: 3 days


# ca269f32 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Move the busdma mapping functions to nvme_qpair.c.

This removes nvme_uio.c completely.

Sponsored by: Intel


# e2b99004 12-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Do not panic when a busdma mapping operation fails.

Instead, print an error message and fail the associated command with
DATA_TRANSFER_ERROR NVMe completion status.

Sponsored by: Intel


# 5fdf9c3c 01-Apr-2013 Jim Harris <jimharris@FreeBSD.org>

Add unmapped bio support to nvme(4) and nvd(4).

Sponsored by: Intel


# 1e526bc4 29-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or
NULL. This simplifies decisions around if/how requests are routed through
busdma. It also paves the way for supporting unmapped bios.

Sponsored by: Intel


# bdd1fd40 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Fix printf format issue on i386.

Reported by: bz


# 547d523e 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Clean up debug prints.

1) Consistently use device_printf.
2) Make dump_completion and dump_command into something more
human-readable.

Sponsored by: Intel
Reviewed by: carl


# 237d2019 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Change a number of malloc(9) calls to use M_WAITOK instead of
M_NOWAIT.

Sponsored by: Intel
Suggested by: carl
Reviewed by: carl


# 43a37256 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Abort and do not retry any outstanding admin commands left over after
a controller reset.

Sponsored by: Intel
Reviewed by: carl


# 232e2edb 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add the ability to internally mark a controller as failed, if it is unable to
start or reset. Also add a notifier for NVMe consumers for controller fail
conditions and plumb this notifier for nvd(4) to destroy the associated
GEOM disks when a failure occurs.

This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state. To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.

Sponsored by: Intel
Reviewed by: carl


# 3d7eb41c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Just disable the controller instead of deleting IO queues during detach.

This is just as effective, and removes the need for a bunch of admin commands
to a controller that's going to be disabled shortly anyways.

Sponsored by: Intel
Reviewed by: carl


# cb5b7c13 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Cap the number of retry attempts to a configurable number. This ensures
that if a specific I/O repeatedly times out, we don't retry it indefinitely.

The default number of retries will be 4, but is adjusted using hw.nvme.retry_count.

Sponsored by: Intel
Reviewed by: carl


# cf81529c 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Create struct nvme_status.

NVMe error log entries include status, so breaking this out into
its own data structure allows it to be included in both the
nvme_completion data structure as well as error log entry data
structures.

While here, expose nvme_completion_is_error(), and change all of
the places that were explicitly looking at sc/sct bits to use this
macro instead.

Sponsored by: Intel
Reviewed by: carl


# f37c22a3 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Make nvme_ctrlr_reset a nop if a reset is already in progress.

This protects against cases where a controller crashes with multiple
I/O outstanding, each timing out and requesting controller resets
simultaneously.

While here, remove a debugging printf from a previous commit, and add
more logging around I/O that need to be resubmitted after a controller
reset.

Sponsored by: Intel
Reviewed by: carl


# 48ce3178 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

By default, always escalate to controller reset when an I/O times out.

While aborts are typically cleaner than a full controller reset, many times
an I/O timeout indicates other controller-level issues where aborts may not
work. NVMe drivers for other operating systems are also defaulting to
controller reset rather than aborts for timed out I/O.

Sponsored by: Intel
Reviewed by: carl


# 94143332 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add a tunable for the I/O timeout interval. Default is still 30 seconds,
but can be adjusted between a min/max of 5 and 120 seconds.

Sponsored by: Intel
Reviewed by: carl


# 12d191ec 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add handling for controller fatal status (csts.cfs).

On any I/O timeout, check for csts.cfs==1. If set, the controller
is reporting fatal status and we reset the controller immediately,
rather than trying to abort the timed out command.

This changeset also includes deferring the controller start portion
of the reset to a separate task. This ensures we are always performing
a controller start operation from a consistent context.

Sponsored by: Intel
Reviewed by: carl


# b846efd7 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add controller reset capability to nvme(4) and ability to explicitly
invoke it from nvmecontrol(8).

Controller reset will be performed in cases where I/O are repeatedly
timing out, the controller reports an unrecoverable condition, or
when explicitly requested via IOCTL or an nvme consumer. Since the
controller may be in such a state where it cannot even process queue
deletion requests, we will perform a controller reset without trying
to clean up anything on the controller first.

Sponsored by: Intel
Reviewed by: carl


# 65c2474e 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Keep a doubly-linked list of outstanding trackers.

This enables in-order re-submission of I/O after a controller reset.

Sponsored by: Intel


# 0a0b08cc 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Enable asynchronous event requests on non-Chatham devices.

Also add logic to clean up all outstanding asynchronous event requests
when resetting or shutting down the controller, since these requests
will not be explicitly completed by the controller itself.

Sponsored by: Intel


# 274b3a88 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Specify command timeout interval on a per-command type basis.

This is primarily driven by the need to disable timeouts for asynchronous
event requests, which by nature should not be timed out.

Sponsored by: Intel


# 879de699 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Explicitly abort a timed out command, if the ABORT command sent to the
controller indicates the command was not found.

Sponsored by: Intel


# 6cb06070 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Break out the code for completing an nvme_tracker object into a separate
function.

This allows for completions outside the normal completion path, for example
when an ABORT command fails due to the controller reporting the targeted
command does not exist. This is mainly for protection against a faulty
controller, but we need to clean up our internal request nonetheless.

Sponsored by: Intel


# 448195e7 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add support for ABORT commands, including issuing these commands when
an I/O times out.

Also ensure that we retry commands that are aborted due to a timeout.

Sponsored by: Intel


# d6f54866 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add an internal _nvme_qpair_submit_request function, which performs
the submit action assuming the qpair lock has already been acquired.

Also change nvme_qpair_submit_request to just lock/unlock the mutex
around a call to this new function.

This fixes a recursive mutex acquisition in the retry path.

Sponsored by: Intel


# 633c5729 31-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Use callout_reset_curcpu to allow the callout to be handled by the
current CPU and not always CPU 0.

This has the added benefit of reducing a huge amount of spinlock
contention on the callout_cpu spinlock for CPU 0.

Sponsored by: Intel


# 9427a0fe 18-Oct-2012 Gleb Smirnoff <glebius@FreeBSD.org>

Fix build after r241659.


# 0f71ecf7 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add ability to queue nvme_request objects if no nvme_trackers are available.

This eliminates the need to manage queue depth at the nvd(4) level for
Chatham prototype board workarounds, and also adds the ability to
accept a number of requests on a single qpair that is much larger
than the number of trackers allocated.

Sponsored by: Intel


# 21b6da58 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Preallocate a limited number of nvme_tracker objects per qpair, rather
than dynamically creating them at runtime.

Sponsored by: Intel


# 5ae9ed68 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Create nvme_qpair_submit_request() which eliminates all of the code
duplication between the admin and io controller-level submit
functions.

Sponsored by: Intel


# c2e83b40 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Simplify how the qpair lock is acquired and released.

Sponsored by: Intel


# 5fa5cc5f 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Cleanup uio-related code to use struct nvme_request and
nvme_ctrlr_submit_io_request().

While here, also fix case where a uio may have more than 1 iovec.
NVMe's definition of SGEs (called PRPs) only allows for the first SGE to
start on a non-page boundary. The simplest way to handle this is to
construct a temporary uio for each iovec, and submit an NVMe request
for each.

Sponsored by: Intel


# d281e8fb 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add nvme_ctrlr_submit_[admin|io]_request functions which consolidates
code for allocating nvme_tracker objects and making calls into
bus_dmamap_load for commands which have payloads.

Sponsored by: Intel


# ad697276 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add struct nvme_request object which contains all of the parameters passed
from an NVMe consumer.

This allows us to mostly build NVMe command buffers without holding the
qpair lock, and also allows for future queueing of nvme_request objects
in cases where the submission queue is full and no nvme_tracker objects
are available.

Sponsored by: Intel


# f2b19f67 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Merge struct nvme_prp_list into struct nvme_tracker.

This simplifies the driver significantly where it is constructing
commands to be submitted to hardware. By reducing the number of
PRPs (NVMe parlance for SGE) from 128 to 32, it ensures we do not
allocate too much memory for more common smaller I/O sizes, while
still supporting up to 128KB I/O sizes.

This also paves the way for pre-allocation of nvme_tracker objects
for each queue which will simplify the I/O path even further.

Sponsored by: Intel


# 9eb93f29 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Add return codes to all functions used for submitting commands to I/O
queues.

Sponsored by: Intel


# 6568ebfc 10-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Count number of times each queue pair's interrupt handler is invoked.

Also add sysctls to query and reset each queue pair's stats, including
the new count added here.

Sponsored by: Intel


# bb0ec6b3 17-Sep-2012 Jim Harris <jimharris@FreeBSD.org>

This is the first of several commits which will add NVM Express (NVMe)
support to FreeBSD. A full description of the overall functionality
being added is below. nvmexpress.org defines NVM Express as "an optimized
register interface, command set and feature set fo PCI Express (PCIe)-based
Solid-State Drives (SSDs)."

This commit adds nvme(4) and nvd(4) driver source code and Makefiles
to the tree.

Full NVMe functionality description:
Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe)
device support.

There will continue to be ongoing work on NVM Express support, but there
is more than enough to allow for evaluation of pre-production NVM Express
devices as well as soliciting feedback. Questions and feedback are welcome.

nvme(4) implements NVMe hardware abstraction and is a provider of NVMe
namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN.
nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks.
nvmecontrol(8) is used for NVMe configuration and management.

The following are currently supported:
nvme(4)
- full mandatory NVM command set support
- per-CPU IO queues (enabled by default but configurable)
- per-queue sysctls for statistics and full command/completion queue
dumps for debugging
- registration API for NVMe namespace consumers
- I/O error handling (except for timeoutsee below)
- compilation switches for support back to stable-7

nvd(4)
- BIO_DELETE and BIO_FLUSH (if supported by controller)
- proper BIO_ORDERED handling

nvmecontrol(8)
- devlist: list NVMe controllers and their namespaces
- identify: display controller or namespace identify data in
human-readable or hex format
- perftest: quick and dirty performance test to measure raw
performance of NVMe device without userspace/physio/GEOM
overhead

The following are still work in progress and will be completed over the
next 3-6 months in rough priority order:
- complete man pages
- firmware download and activation
- asynchronous error requests
- command timeout error handling
- controller resets
- nvmecontrol(8) log page retrieval

This has been primarily tested on amd64, with light testing on i386. I
would be happy to provide assistance to anyone interested in porting
this to other architectures, but am not currently planning to do this
work myself. Big-endian and dmamap sync for command/completion queues
are the main areas that would need to be addressed.

The nvme(4) driver currently has references to Chatham, which is an
Intel-developed prototype board which is not fully spec compliant.
These references will all be removed over time.

Sponsored by: Intel
Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>