Cross Reference: /freebsd-current/sys/dev/nvme/nvme

History log of /freebsd-current/sys/dev/nvme/nvme_qpair.c
Revision	Date	Author	Comments
# 0dd84c3b	13-May-2024	Warner Losh <imp@FreeBSD.org>	nvme: Add comment about where tr->deadline is set It's easy to overlook the chain of events that lead to tr->deadline being updated. Add a comment here to explain what otherwise looks like an oversight w/o careful study. Sponsored by: Netflix
# c931cf6a	13-May-2024	Warner Losh <imp@FreeBSD.org>	nvme: Slight simplification We don't need to dereference qpair to get the ctrlr pointer each time, so use the cached value. It's not going to change. No change intended. Sponsored by: Netflix
# 9db8ca92	13-May-2024	Warner Losh <imp@FreeBSD.org>	nvme: Slight reworking this loop to match FreeBSD style Update the comment for the code, and slightly rework the code in the 'fast exit' paradigm that FreeBSD generally tries to do. Sponsored by: Netflix
# 5a178b83	13-May-2024	Warner Losh <imp@FreeBSD.org>	nvme: Add locking asserts nvme_qpair_complete_tracker and nvme_qpair_manual_complete_tracker have to be called without the qpair lock, so assert its unowned. Sponsored by: Netflix
# 5650bd3f	29-Jan-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Use the NVMEF macro to construct fields Reviewed by: chuck, imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43605
# 479680f2	29-Jan-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Use the NVMEV macro instead of expanded versions Reviewed by: chuck Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43595
# fdafd315	24-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Automated cleanup of cdefs and other formatting Apply the following automated changes to try to eliminate no-longer-needed sys/cdefs.h includes as well as now-empty blank lines in a row. Remove /^#if.\n#endif.\n#include\s+<sys/cdefs.h>.\n/ Remove /\n+#include\s+<sys/cdefs.h>.\n+#if.\n#endif.\n+/ Remove /\n+#if.\n#endif.\n+/ Remove /^#if.\n#endif.\n/ Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/ Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/ Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/ Sponsored by: Netflix
# 8d6c0743	06-Nov-2023	Alexander Motin <mav@FreeBSD.org>	nvme: Introduce longer timeouts for admin queue KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty namespace. In some cases like hot-plug it takes longer, triggering timeout and controller resets after just 30 seconds. Linux for many years has separate 60 seconds timeout for admin queue. This patch does the same. And it is good to be consistent. Sponsored by: iXsystems, Inc. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D42454
# afc3d49b	10-Oct-2023	Warner Losh <imp@FreeBSD.org>	nvme: Close a race in destroying qpair and timeouts While we should have cleared all the pending I/O prior to calling nvme_qpair_destroy, which should ensure that if the callout_drain causes a call to nvme_qpair_timeout(), it won't schedule any new timeout. However, it doesn't hurt to set timeout_pending to false in nvme_qpair_destroy() and have nvme_qpair_timeout() exit early if it sees it w/o scheduling a timeout. Since we don't otherwise stop the timeout until we're about to destroy the qpair, this ensures we fail safe. The lock/unlock also ensures the callout_drain will either remove the callout, or wait for it to run with the early bailout. We can likely further improve this by using callout_stop() inside the pending lock. I'll investigate that for future refinement. Sponsored by: Netflix Suggestions by: jhb Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D42065
# 9cd7b624	10-Oct-2023	Warner Losh <imp@FreeBSD.org>	nvme: Eliminate RECOVERY_FAILED state While it seemed like a good idea to have this state, we can do everything we wanted with the state by checking ctrlr->is_failed since that's set before we start failing the qpairs. Add some comments about racing when we're failing the controller, though in practice I'm not sure that kind of race could even be lost. Sponsored by: Netflix Reviewed by: chuck, gallatin, jhb Differential Revision: https://reviews.freebsd.org/D42051
# 1d6021cd	25-Sep-2023	Warner Losh <imp@FreeBSD.org>	nvme: Supress noise messages When we're suspending, we get messages about waiting for the controller to reset. These are in error: we're not waiting for it to reset. We put the recovery state as part of suspending, so we should suppress these as a false positive. Also remove a stray debug that's left over from earlier versions of the recovery code that no longer makes sense. Sponsored by: Netflix
# da8324a9	24-Sep-2023	Warner Losh <imp@FreeBSD.org>	nvme: Fix locking protocol violation to fix suspend / resume Currently, when we suspend, we need to tear down all the qpairs. We call nvme_admin_qpair_abort_aers with the admin qpair lock held, but the tracker it will call for the pending AER also locks it (recursively) hitting an assert. This routine is called without the qpair lock held when we destroy the device entirely in a number of places. Add an assert to this effect and drop the qpair lock before calling it. nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the list, dropping it around calls to nvme_qpair_complete_tracker, and restarting the list scan after picking it back up. Note: If interrupts are still running, there's a tiny window for these AERs: If one fires just an instant after we manually complete it, then we'll be fine: we set the state of the queue to 'waiting' and we ignore interrupts while 'waiting'. We know we'll destroy all the queue state with these pending interrupts before looking at them again and we know all the TRs will have been completed or rescheduled. So either way we're covered. Also, tidy up the failure case as well: failing a queue is a superset of disabling it, so no need to call disable first. This solves solves some locking issues with recursion since we don't need to recurse.. Set the qpair state of failed queues to RECOVERY_FAILED and stop scheduling the watchdog. Assert we're not failed when we're enabling a qpair, since failure currently is one-way. Make failure a little less verbose. Next, kill the pre/post reset stuff. It's completely bogus since we disable the qparis, we don't need to also hold the lock through the reset: disabling will cause the ISR to return early. This keeps us from recursing on the recovery lock when resuming. We only need the recovery lock to avoid a specific race between the timer and the ISR. Finally, kill NVME_RESET_2X. It'S been a major release since we put it in and nobody has used it as far as I can tell. And it was a motivator for the pre/post uglification. These are all interrelated, so need to be done at the same time. Sponsored by: Netflix Reviewed by: jhb Tested by: jhb (made sure suspend / resume worked) MFC After: 3 days Differential Revision: https://reviews.freebsd.org/D41866
# d9543162	15-Sep-2023	Warner Losh <imp@FreeBSD.org>	nvme: Give up when we've failed Normally, we poll the device every so often to see if commands have timed out. However, we'll go into the recovery state as part of failing the drive. To account for all possibilties, if we're failed when we get into the polling function, just stop polling: Party is over. Sponsored by: Netflix
# 8052b01e	25-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Add exclusion for ISR Add a basically uncontended spinlock that we take out while the ISR is running. This has two effects: First, when we get a timeout, we can safely call the nvme_qpair_process_completions w/o racing any ISRs. Second, we can use it to ensure that we don't reset the card while the ISRs are active (right now we just sleep and hope for the best, which usually is fine, but not always). Sponsored by: Netflix MFC After: 2 weeks Reviewed by: chuck, gallatin Differential Revision: https://reviews.freebsd.org/D41452
# d4959bfc	25-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Greatly improve error recovery Next phase of error recovery: Eliminate the REOVERY_START phase, since we don't need to wait to start recovery. Eliminate the RECOVERY_RESET phase since it is transient, we now transition from RECOVERY_NORMAL into RECOVERY_WAITING. In normal mode, read the status of the controller. If it is in failed state, or appears to be hot-plugged, jump directly to reset which will sort out the proper things to do. This will cause all pending I/O to complete with an abort status before the reset. When in the NORMAL state, call the interrupt handler. This will complete all pending transactions when interrupts are broken or temporarily misbehaving. We then check all the pending completions for timeouts. If we have abort enabled, then we'll send an abort. Otherwise we'll assume the controller is wedged and needs a reset. By calling the interrupt handler here, we'll avoid an issue with the current code where we transitioned to RECOVERY_START which prevented any completions from happening. Now completions happen. In addition and follow-on I/O that is scheduled in the completion routines will be submitted, rather than queued, because the recovery state is correct. This also fixes a problem where I/O would timeout, but never complete, leading to hung I/O. Resetting remains the same as before, just when we chose to reset has changed. A nice side effect of these changes is that we now do I/O when interrupts to the card are totally broken. Followon commits will improve the error reporting and logging when this happens. Performance will be aweful, but will at least be minimally functional. There is a small race when we're checking the completions if interrupts are working, but this is handled in a future commit. Sponsored by: Netflix MFC After: 2 weeks Differential Revision: https://reviews.freebsd.org/D36922
# 2a6b7055	25-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Timeout expired transactions When we went to having a shared timeout routine, failing the timed-out transaction code was inadvertantly dropped. Reinstate it. Fixes: 502dc84a8b670 Sponsored by: Netflix MFC After: 2 weeks Reviewed by: chuck, jhb Differential Revision: https://reviews.freebsd.org/D36921
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 2ad9a815	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Directly lookup op code Rather than have a table to walk through, use a sparse array. Suggested by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D41353
# 95cd10f1	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Add comments about other fields in status When manually completing an I/O, we do so because we have no status back from the card. Note M, CRD and P are all 0 because this is an artificial event (and phase isn't checked when it's completed this way). There's no MORE information in the error log page and there's no delayed retry (CRD=0) and we don't currently request CRD to be set to anything other than 0 and thus don't implement delayed retry. Sponsored by: Netflix Reviewed by: chuck, mav, jhb Differential Revision: https://reviews.freebsd.org/D41314
# a510dbc8	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Be less verbose when cancelling I/O or admin commands When we're resetting, and there's outstanding I/O that we're cancelling, only report we're cancelling the I/O once rather than once per I/O. Likewise when we reschedule the I/O. We don't need to say for each one that we're cancelling/rescheduling something, and then report the I/O that we're doing. Likewise with cancelling admin commands (we never retry them here, so a similar change isn't needed). Sponsored by: Netflix Reviewed by: chuck, mav Differential Revision: https://reviews.freebsd.org/D41313
# ac8c866f	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Add more NVME Base Spec 2.0 and NVME Command Set Spec 1.0a Add admin commands capacity management, lockdown and fabrics commands. Add I/O copy command. Sponsored by: Netflix Reviewed by: chuck, mav, jhb Differential Revision: https://reviews.freebsd.org/D41311
# edd23e4d	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Eliminate redundant code get_admin_opcode_string and get_io_opcode_string are identical, but start with different tables. Use a helper routine that takes an argument to implement these instead. A future commit will refine this further. Sponsored by: Netflix Reviewed by: chuck, mav, jhb Differential Revision: https://reviews.freebsd.org/D41310
# 7be0b068	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Remove duplicate command printing routine Both nvme_dump_command and nvme_qpair_print_command print nvme commands. The former latter better. Recode the one call to nvme_dump_command to use nvme_qpair_print_command and delete the former. No sense having two nearly identical routines. A future commit will convert to sbuf. Sponsored by: Netflix Reviewed by: chuck, mav, jhb Differential Revision: https://reviews.freebsd.org/D41309
# 6f76d493	07-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Remove duplicate completion printing routine Both nvme_dump_completion and nvme_qpair_print_completion print completions. The latter is better. Recode the two instances of nvme_dump_completion to use nvme_qpair_print_completion and delete the former. No sense having two nearly identical routines. A future commit will convert this to sbuf. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D41308
# 92103adb	24-Jul-2023	John Baldwin <jhb@FreeBSD.org>	nvme: Use a memdesc for the request buffer instead of a bespoke union. This avoids encoding CAM-specific knowledge in nvme_qpair.c. Reviewed by: chuck, imp, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D41119
# 5ae44634	27-Jun-2023	John Baldwin <jhb@FreeBSD.org>	nvme: Fix typo in "Command Aborted by Host" constant name. Reviewed by: chuck, imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D40763
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 49ebbdb2	08-Mar-2023	Alexander Motin <mav@FreeBSD.org>	Add NAMESPACE MANAGEMENT into admin_opcode[]. MFC after: 1 week
# 4982884b	11-Oct-2022	Warner Losh <imp@FreeBSD.org>	nvme: Always set deadline to max When a transaction is on the outstanding list, it needs to have a valid timeout value, so set it to infinity before placing it on the list. Place before we put it on the list, even though the list is protected by the qpair lock. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D36920
# a69c0964	05-Aug-2022	Alexander Motin <mav@FreeBSD.org>	nvme: Print CRD, M and DNR status bits on errors. It may help with some issues debugging. MFC after: 1 week
# 0fd4cd40	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Use controller's page size instead of PAGE_SIZE to create qpair When constructing qpair, use the controller's notion of page size rather than the host's PAGE_SIZE. Currently, these are both 4k, but the arm 16k page size support requires decoupling. There's a "hidden" PAGE_SIZE in btoc, so we must change btoc(x) to howmany(x, ctrlr->page_size) to properly count the number of pages (in the drive's world view) are needed for various calculations. With these changes, we the nvme driver operates at production level load for both host 4k and host 16k page size. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34873
# dfa01f4f	08-Apr-2022	Gordon Bergling <gbe@FreeBSD.org>	nvme(4): Fix a typo in a source code comment - s/is is/is/ MFC after: 3 days
# b3c9b606	06-Jan-2022	Alexander Motin <mav@FreeBSD.org>	nvme: Do not rearm timeout for commands without one. Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding. They have no timeouts, but their presence in qpair->outstanding_tr caused useless timeout callout rearming twice a second. While there, relax timeout callout period from 0.5s to 0.5-1s to improve aggregation. Command timeouts are measured in seconds, so we don't need to be precise here. Reviewed by: imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D33781
# 2ec165e3	14-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Reduce traffic to the doorbell register Reduce traffic to doorbell register when processing multiple completion events at once. Only write it at the end of the loop after we've processed everything (assuming we found at least one completion, even if that completion wasn't valid). Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32470
# 18dc12bf	12-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Restore hotplug warning Restore hotplug warning in recovery state machine. No functional change other than what message gets printed. Sponsored by: Netflix
# 36a87d0c	28-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: Sanity check completion id Make sure the completion ID is in the range of [0..num_trackers) since the values past the end of the act_tr array are never going to be valid trackers and will lead to pain and suffering if we try to dereference them to get the tracker or to set the tracker back to NULL as we complete the I/O. Sponsored by: Netflix Reviewed by: mav, chs, chuck Differential Revision: https://reviews.freebsd.org/D32088
# 587aa255	28-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: count number of ignored interrupts Count the number of times we're asked to process completions, but that we ignore because the state of the qpair isn't in RECOVERY_NONE. Sponsored by: Netflix Reviewed by: mav, chuck Differential Revision: https://reviews.freebsd.org/D32212
# 7d5eebe0	28-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: Add sanity check for phase on startup. The proper phase for the qpiar right after reset in the first interrupt is 1. For it, make sure that we're not still in phase 0. This is an illegal state to be processing interrupts and indicates that we've failed to properly protect against a race between initializing our state and processing interrupts. Modify stat resetting code so it resets the number of interrpts to 1 instead of 0 so we don't trigger a false positive panic. Sponsored by: Netflix Reviewed by: cperciva, mav (prior version) Differential Revision: https://reviews.freebsd.org/D32211
# fa81f373	28-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: start qpair in state RECOVERY_WAITING An interrupt happens on the admin queue right away after the reset, so as soon as we enable interrupts, we'll get a call to our interrupt handler. It is safe to ignore this interrupt if we're not yet initialized, or to process it if we are. If we are initialized, we'll see there's no completion records and return. If we're not, we'll process no completion records and return. Either way, nothing is processed and nothing is lost. Until we've completely setup the qpair, we need to avoid processing completion records. Start the qpair in the waiting recovery state so we return immediately when we try to process completions. The code already sets it to 'NONE' when we're initialization is complete. It's safe to defer completion processing here because we don't send any commands before the initialization of the software state of the qpair is complete. And even if we were to somehow send a command prior to that completing, the completion record for that command would be processed when we send commands to the admin qpair after we've setup the software state. There's no good central point to add an assert for this last condition. This fixes an KASSERT "received completion for unknown cmd" panic on boot. Fixes: 502dc84a8b6703e7c0626739179a3cdffdd22d81 Sponsored by: Netflix Reviewed by: mav, cperciva, gallatin Differential Revision: https://reviews.freebsd.org/D32210
# 502dc84a	23-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: Use shared timeout rather than timeout per transaction Keep track of the approximate time commands are 'due' and the next deadline for a command. twice a second, wake up to see if any commands have entered timeout. If so, quiessce and then enter a recovery mode half the timeout further in the future to allow the ISR to complete. Once we exit recovery mode, we go back to operations as normal. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28583
# 4b977e6d	17-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme/nda: Fail all nvme I/Os after controller fails Once the controller has failed, fail all I/O w/o sending it to the device. The reset of the nvme driver won't schedule any I/O to the failed device, and the controller is in an indeterminate state and can't accept I/O. Fail both at the top end of the sim and the bottom end. Don't bother queueing up the I/O for failure in a different task. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31341
# e3bdf3da	31-Aug-2021	Alexander Motin <mav@FreeBSD.org>	nvme(4): Add MSI and single MSI-X support. If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week
# fc9a0840	15-Jul-2021	Warner Losh <imp@FreeBSD.org>	nvme: Enable interrupts after qpair fully constructed To guard against the ill effects of a spurious interrupt during construction (or one that was bogusly pending), enable interrupts after the qpair is completely constructed. Otherwise, we can die with null pointer dereferences in nvme_qpair_process_completions. This has been observed in at least one pre-release NVMe drive where the MSIX interrupt fired while the queue was being created, before we'd started the NVMe controller card. The alternative of only turning on the interrupts after the rest was tried, but was insufficient to work around this bug and made the code more complicated w/o benefit. Reviewed by: mav, chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31182
# aa0ab681	02-Jul-2021	Warner Losh <imp@FreeBSD.org>	nvme: coherently read status of completion records Coherently read the phase bit of the status completion record. We loop over the completion record array, looking for all the transactions in the same phase that have been completed. In doing that, we have to be careful to read the status field first, and if it indicates a complete record, we need to read and process that record. Otherwise, the host might be overtaken by device when reading this completion record, leading to a mistaken belief that the record is in phase. This leads to the code using old values and looking at an already completed entry, which has no current tracker. To work around this problem, we read the status and make sure it is in phase, we then re-read the entire completion record guaranteeing it's complete, valid, and consistent . In addition we resync the dmatag to reflect changes since the prior loop for the bouncing dma case. Reviewed by: jrtc27@, chuck@ Found by: jrtc27 (this fix is based in part on her D30995 fix) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31002
# 9600aa31	08-Feb-2021	Warner Losh <imp@FreeBSD.org>	nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.
# 082905ca	04-Dec-2020	Warner Losh <imp@FreeBSD.org>	nvme: Remove a wmb() that's not necessary. bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can be DMA'd immediately after it returns. The details differ, but this mirrors atomic thread release semantics, at least for the buffers synced. For non-x86 platforms, bus_dmamap_sync() has the right syncing and fences. So in the past, wmb() had been omitted for them. For x86 platforms, the memory ordering is already strong enough to ensure DMA to the device sees the current contents. As such, we don't need the wmb() here. It translates to an sfence which is only needed for writes to regions that have the write combining attribute set or when some exotic opcodes are used. The nvme driver does neither of these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we can be assured any optimizer won't reorder the bus_dmamap_sync and the bus_space_write operations. The wmb() was a vestiage of the pre-busdma version initially committed to the tree. Reviewed by: kib@, gallatin@, chuck@, mav@ Differential Revision: https://reviews.freebsd.org/D27448
# 8f9d5a8d	02-Dec-2020	Michal Meloun <mmel@FreeBSD.org>	NVME: Multiple busdma related fixes. - in nvme_qpair_process_completions() do dma sync before completion buffer is used. - in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure that all CPU stores are visible to external (including DMA) observers. - Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems, buffers continuously owned (and accessed) by DMA must be allocated with this flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems (or coherent buses in mixed systems). MFC after: 4 weeks Reviewed by: mav, imp Differential Revision: https://reviews.freebsd.org/D27446
# 8d08cdc7	02-Dec-2020	Chuck Tuffli <chuck@FreeBSD.org>	nvme: Fix typo in definition Change occurrences of "selt test" to "self tests in the NVMe header file. Reviewed by: imp, mav MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27439
# ac90f70d	28-Nov-2020	Alexander Motin <mav@FreeBSD.org>	Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# d87b31e1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	nvme: clean up empty lines in .c and .h files
# 96ad26ee	04-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Remove free_domain() and uma_zfree_domain(). These functions were introduced before UMA started ensuring that freed memory gets placed in domain-local caches. They no longer serve any purpose since UMA now provides their functionality by default. Remove them to simplyify the kernel memory allocator interfaces a bit. Reviewed by: cem, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25937
# ead7e103	18-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Make polled request timeout less invasive. Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 550d5d64	17-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Fix admin qpair leak if detached during initial reset. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 4053f8ac	02-May-2020	David Bright <dab@FreeBSD.org>	Fix various Coverity-detected errors in nvme driver This fixes several Coverity-detected errors in the nvme driver. CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975, 1403980 Reviewed by: imp@, vangyzen@ MFC after: 5 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24532
# aeb665b5	30-Mar-2020	Ed Maste <emaste@FreeBSD.org>	remove extraneous double ;s in sys/
# 0a4b14e8	15-Dec-2019	Michal Meloun <mmel@FreeBSD.org>	Properly synchronize completion DMA buffers. Within command completion processing the callback function may access DMAed data buffer. Synchronize it before use, not after. This allows to use NVMe disk on non-DMA coherent arm64 system. MFC after: 3 weeks
# 7588c6cc	13-Dec-2019	Warner Losh <imp@FreeBSD.org>	Move to using bool instead of boolean_t While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999
# 43393e8b	06-Dec-2019	Warner Losh <imp@FreeBSD.org>	trackers always know what qpair they are on Don't needlessly pass around qpair pointers when the tracker knows what qpair it's on. This will simplify code and make it easier to split submission and completion queues in the future. Signed-off-by: John Meneghini <johnm@netapp.com>
# 1eab19cb	23-Sep-2019	Alexander Motin <mav@FreeBSD.org>	Make nvme(4) driver some more NUMA aware. - For each queue pair precalculate CPU and domain it is bound to. If queue pairs are not per-CPU, then use the domain of the device. - Allocate most of queue pair memory from the domain it is bound to. - Bind callouts to the same CPUs as queue pair to avoid migrations. - Do not assign queue pairs to each SMT thread. It just wasted resources and increased lock congestions. - Remove fixed multiplier of CPUs per queue pair, spread them even. This allows to use more queue pairs in some hardware configurations. - If queue pair serves multiple CPUs, bind different NVMe devices to different CPUs. MFC after: 1 month Sponsored by: iXsystems, Inc.
# f93b7f95	04-Sep-2019	Warner Losh <imp@FreeBSD.org>	Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514
# 71a28181	21-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Improve NVMe hot unplug handling. If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# a6d222eb	02-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Add more random bits from NVMe 1.4. MFC after: 2 weeks
# 90dfa8f0	01-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Add more new fields and values from NVMe 1.4. MFC after: 2 weeks
# 5e83c2ff	19-Jul-2019	Warner Losh <imp@FreeBSD.org>	Keep track of the number of commands that exhaust their retry limit. While we print failure messages on the console, sometimes logs are lost or overwhelmed. Keeping a count of how many times we've failed retriable commands helps get a magnitude of the problem.
# c37fc318	19-Jul-2019	Warner Losh <imp@FreeBSD.org>	Keep track of the number of retried commands. Retried commands can indicate a performance degredation of an nvme drive. Keep track of the number of retries and report it out via sysctl, just like number of commands an interrupts.
# c75bdc04	18-Jul-2019	Warner Losh <imp@FreeBSD.org>	Provide new tunable hw.nvme.verbose_cmd_dump The nvme drive dumps only the most relevant details about a command when it fails. However, there are times this is not sufficient (such as debugging weird issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1 in loader.conf will enable more complete debugging information about each command that fails. Reviewed by: rpokala Sponsored by: Netflix Differential Version: https://reviews.freebsd.org/D20988
# d0aaeffd	01-Jun-2019	Warner Losh <imp@FreeBSD.org>	Since a fatal trap can happen at aribtrary times, don't panic when the completions are not in a consistent state. Cope with the different places the normal I/O completion polling thread can be interrupted and then re-entered during a kernel panic + dump. Reviewed by: jhb and markj (both prior versions) Differential Revision: https://reviews.freebsd.org/D20478
# 2ffd6fce	08-Mar-2019	Warner Losh <imp@FreeBSD.org>	Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431
# 95108cad	02-Mar-2019	Warner Losh <imp@FreeBSD.org>	Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why to comment (code already does this)
# 45d7e233	27-Feb-2019	Warner Losh <imp@FreeBSD.org>	Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.
# d706306d	27-Feb-2019	Warner Losh <imp@FreeBSD.org>	Remove #ifdef code to support FreeBSD versions that haven't been supported in years. A number of changes have been made to the driver that likely wouldn't work on those older versions that aren't properly ifdef'd and it's project policy to GC such code once it is stale.
# a6461357	26-Dec-2018	Alexander Motin <mav@FreeBSD.org>	Add descriptions to NVMe interrupts. MFC after: 1 month
# 9544e6dc	21-Aug-2018	Chuck Tuffli <chuck@FreeBSD.org>	Make NVMe compatible with the original API The original NVMe API used bit-fields to represent fields in data structures defined by the specification (e.g. the op-code in the command data structure). The implementation targeted x86_64 processors and defined the bit fields for little endian dwords (i.e. 32 bits). This approach does not work as-is for big endian architectures and was changed to use a combination of bit shifts and masks to support PowerPC. Unfortunately, this changed the NVMe API and forces #ifdef's based on the OS revision level in user space code. This change reverts to something that looks like the original API, but it uses bytes instead of bit-fields inside the packed command structure. As a bonus, this works as-is for both big and little endian CPU architectures. Bump __FreeBSD_version to 1200081 due to API change Reviewed by: imp, kbowling, smh, mav Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D16404
# 2e0090af	03-Aug-2018	Justin Hibbits <jhibbits@FreeBSD.org>	nvme(4): Add bus_dmamap_sync() at the end of the request path Summary: Some architectures, in this case powerpc64, need explicit synchronization barriers vs device accesses. Prior to this change, when running 'make buildworld -j72' on a 18-core (72-thread) POWER9, I would see controller resets often. With this change, I don't see these resets messages, though another tester still does, for yet to be determined reasons, so this may not be a complete fix. Additionally, I see a ~5-10% speed up in buildworld times, likely due to not needing to reset the controller. Reviewed By: jimharris Differential Revision: https://reviews.freebsd.org/D16570
# c6c70c07	30-Apr-2018	Alexander Motin <mav@FreeBSD.org>	Fix use-after-free in nvme_qpair_destroy(). dma_tag_payload should not be destroyed before payload_dma_map, and seems it should be used there instead of dma_tag to match creation. Sponsored by: iXsystems, Inc.
# d85d9648	15-Mar-2018	Warner Losh <imp@FreeBSD.org>	Try polling the qpairs on timeout. On some systems, we're getting timeouts when we use multiple queues on drives that work perfectly well on other systems. On a hunch, Jim Harris suggested I poll the completion queue when we get a timeout. This patch polls the completion queue if no fatal status was indicated. If it had pending I/O, we complete that request and return. Otherwise, if aborts are enabled and no fatal status, we abort the command and return. Otherwise we reset the card. This may clear up the problem, or we may see it result in lots of timeouts and a performance problem. Either way, we'll know the next step. We may also need to pay attention to the fatal status bit of the controller. PR: 211713 Suggested by: Jim Harris Sponsored by: Netflix
# 6b1a96b1	10-Mar-2018	Alexander Motin <mav@FreeBSD.org>	Add new opcodes and statuses from NVMe 1.3a. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 0d787e9b	22-Feb-2018	Wojciech Macek <wma@FreeBSD.org>	NVMe: Add big-endian support Remove bitfields from defined structures as they are not portable. Instead use shift and mask macros in the driver and nvmecontrol application. NVMe is now working on powerpc64 host. Submitted by: Michal Stanek <mst@semihalf.com> Obtained from: Semihalf Reviewed by: imp, wma Sponsored by: IBM, QCM Technologies Differential revision: https://reviews.freebsd.org/D13916
# 718cf2cc	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/dev: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
# 51977281	29-Aug-2017	Warner Losh <imp@FreeBSD.org>	Add CAM/NVMe support for CAM_DATA_SG This adds support in pass(4) for data to be described with a scatter-gather list (sglist) to augment the existing (single) virtual address. Differential Revision: https://reviews.freebsd.org/D11361 Submitted by: Chuck Tuffli Reviewed by: imp@, scottl@, kenm@
# 824073fb	07-Mar-2017	Warner Losh <imp@FreeBSD.org>	Avoid dereferencing unintialized elements in the error path. Some drives sometimes have errors for things like setting the number of queue entries in the submission queue. The error paths taken for these drives ensure a panic dereferencing uninialized data. Sponsored by: Netflix
# a965389b	07-Nov-2016	Scott Long <scottl@FreeBSD.org>	Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a bunch of safery belts and error handling in related codepaths. Reviewed by: jimharris Obtained from: Netflix Differential Revision: D8453
# e5af5854	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: do not pre-allocate MSI-X IRQ resources The issue referenced here was resolved by other changes in recent commits, so this code is no longer needed. MFC after: 3 days Sponsored by: Intel
# 3345ed9a	08-Apr-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: use BUS_SPACE_MAXSIZE for bus_dma_tag_create maxsize parameter This fixes i386 PAE build fallout from r281281. Reported by: bz MFC after: 1 week
# 36b0e4ee	08-Apr-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: remove CHATHAM related code Chatham was an internal NVMe prototype board used for early driver development. MFC after: 1 week Sponsored by: Intel
# a6e30963	08-Apr-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: create separate DMA tag for non-payload DMA buffers Submission and completion queue memory need to use a separate DMA tag for mappings than payload buffers, to ensure mappings remain contiguous even with DMAR enabled. Submitted by: kib MFC after: 1 week Sponsored by: Intel
# f42ca756	18-Mar-2014	Jim Harris <jimharris@FreeBSD.org>	nvme: Allocate all MSI resources up front so that we can fall back to INTx if necessary. Sponsored by: Intel MFC after: 3 days
# 1416ef36	17-Mar-2014	Jim Harris <jimharris@FreeBSD.org>	nvme: NVMe specification dictates 4-byte alignment for PRPs (not 8). Sponsored by: Intel MFC after: 3 days
# e9efbc13	09-Jul-2013	Jim Harris <jimharris@FreeBSD.org>	Update copyright dates. MFC after: 3 days
# bbd412dd	26-Jun-2013	Jim Harris <jimharris@FreeBSD.org>	Remove remaining uio-related code. The nvme_physio() function was removed quite a while ago, which was the only user of this uio-related code. Sponsored by: Intel MFC after: 3 days
# 7b68ae1e	26-Jun-2013	Jim Harris <jimharris@FreeBSD.org>	Fail any passthrough command whose transfer size exceeds the controller's max transfer size. This guards against rogue commands coming in from userspace. Also add KASSERTS for the virtual address and unmapped bio cases, if the transfer size exceeds the controller's max transfer size. Sponsored by: Intel MFC after: 3 days
# 8d09e3c4	26-Jun-2013	Jim Harris <jimharris@FreeBSD.org>	Use MAXPHYS to specify the maximum I/O size for nvme(4). Also allow admin commands to transfer up to this maximum I/O size, rather than the artificial limit previously imposed. The larger I/O size is very beneficial for upcoming firmware download support. This has the added benefit of simplifying the code since both admin and I/O commands now use the same maximum I/O size. Sponsored by: Intel MFC after: 3 days
# ca269f32	12-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Move the busdma mapping functions to nvme_qpair.c. This removes nvme_uio.c completely. Sponsored by: Intel
# e2b99004	12-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Do not panic when a busdma mapping operation fails. Instead, print an error message and fail the associated command with DATA_TRANSFER_ERROR NVMe completion status. Sponsored by: Intel
# 5fdf9c3c	01-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Add unmapped bio support to nvme(4) and nvd(4). Sponsored by: Intel
# 1e526bc4	29-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or NULL. This simplifies decisions around if/how requests are routed through busdma. It also paves the way for supporting unmapped bios. Sponsored by: Intel
# bdd1fd40	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Fix printf format issue on i386. Reported by: bz
# 547d523e	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Clean up debug prints. 1) Consistently use device_printf. 2) Make dump_completion and dump_command into something more human-readable. Sponsored by: Intel Reviewed by: carl
# 237d2019	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Change a number of malloc(9) calls to use M_WAITOK instead of M_NOWAIT. Sponsored by: Intel Suggested by: carl Reviewed by: carl
# 43a37256	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Abort and do not retry any outstanding admin commands left over after a controller reset. Sponsored by: Intel Reviewed by: carl
# 232e2edb	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add the ability to internally mark a controller as failed, if it is unable to start or reset. Also add a notifier for NVMe consumers for controller fail conditions and plumb this notifier for nvd(4) to destroy the associated GEOM disks when a failure occurs. This requires a bit of work to cover the races when a consumer is sending I/O requests to a controller that is transitioning to the failed state. To help cover this condition, add a task to defer completion of I/Os submitted to a failed controller, so that the consumer will still always receive its completions in a different context than the submission. Sponsored by: Intel Reviewed by: carl
# 3d7eb41c	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Just disable the controller instead of deleting IO queues during detach. This is just as effective, and removes the need for a bunch of admin commands to a controller that's going to be disabled shortly anyways. Sponsored by: Intel Reviewed by: carl
# cb5b7c13	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Cap the number of retry attempts to a configurable number. This ensures that if a specific I/O repeatedly times out, we don't retry it indefinitely. The default number of retries will be 4, but is adjusted using hw.nvme.retry_count. Sponsored by: Intel Reviewed by: carl
# cf81529c	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Create struct nvme_status. NVMe error log entries include status, so breaking this out into its own data structure allows it to be included in both the nvme_completion data structure as well as error log entry data structures. While here, expose nvme_completion_is_error(), and change all of the places that were explicitly looking at sc/sct bits to use this macro instead. Sponsored by: Intel Reviewed by: carl
# f37c22a3	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Make nvme_ctrlr_reset a nop if a reset is already in progress. This protects against cases where a controller crashes with multiple I/O outstanding, each timing out and requesting controller resets simultaneously. While here, remove a debugging printf from a previous commit, and add more logging around I/O that need to be resubmitted after a controller reset. Sponsored by: Intel Reviewed by: carl
# 48ce3178	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	By default, always escalate to controller reset when an I/O times out. While aborts are typically cleaner than a full controller reset, many times an I/O timeout indicates other controller-level issues where aborts may not work. NVMe drivers for other operating systems are also defaulting to controller reset rather than aborts for timed out I/O. Sponsored by: Intel Reviewed by: carl
# 94143332	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add a tunable for the I/O timeout interval. Default is still 30 seconds, but can be adjusted between a min/max of 5 and 120 seconds. Sponsored by: Intel Reviewed by: carl
# 12d191ec	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add handling for controller fatal status (csts.cfs). On any I/O timeout, check for csts.cfs==1. If set, the controller is reporting fatal status and we reset the controller immediately, rather than trying to abort the timed out command. This changeset also includes deferring the controller start portion of the reset to a separate task. This ensures we are always performing a controller start operation from a consistent context. Sponsored by: Intel Reviewed by: carl
# b846efd7	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add controller reset capability to nvme(4) and ability to explicitly invoke it from nvmecontrol(8). Controller reset will be performed in cases where I/O are repeatedly timing out, the controller reports an unrecoverable condition, or when explicitly requested via IOCTL or an nvme consumer. Since the controller may be in such a state where it cannot even process queue deletion requests, we will perform a controller reset without trying to clean up anything on the controller first. Sponsored by: Intel Reviewed by: carl
# 65c2474e	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Keep a doubly-linked list of outstanding trackers. This enables in-order re-submission of I/O after a controller reset. Sponsored by: Intel
# 0a0b08cc	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Enable asynchronous event requests on non-Chatham devices. Also add logic to clean up all outstanding asynchronous event requests when resetting or shutting down the controller, since these requests will not be explicitly completed by the controller itself. Sponsored by: Intel
# 274b3a88	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Specify command timeout interval on a per-command type basis. This is primarily driven by the need to disable timeouts for asynchronous event requests, which by nature should not be timed out. Sponsored by: Intel
# 879de699	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Explicitly abort a timed out command, if the ABORT command sent to the controller indicates the command was not found. Sponsored by: Intel
# 6cb06070	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Break out the code for completing an nvme_tracker object into a separate function. This allows for completions outside the normal completion path, for example when an ABORT command fails due to the controller reporting the targeted command does not exist. This is mainly for protection against a faulty controller, but we need to clean up our internal request nonetheless. Sponsored by: Intel
# 448195e7	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add support for ABORT commands, including issuing these commands when an I/O times out. Also ensure that we retry commands that are aborted due to a timeout. Sponsored by: Intel
# d6f54866	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add an internal _nvme_qpair_submit_request function, which performs the submit action assuming the qpair lock has already been acquired. Also change nvme_qpair_submit_request to just lock/unlock the mutex around a call to this new function. This fixes a recursive mutex acquisition in the retry path. Sponsored by: Intel
# 633c5729	31-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Use callout_reset_curcpu to allow the callout to be handled by the current CPU and not always CPU 0. This has the added benefit of reducing a huge amount of spinlock contention on the callout_cpu spinlock for CPU 0. Sponsored by: Intel
# 9427a0fe	18-Oct-2012	Gleb Smirnoff <glebius@FreeBSD.org>	Fix build after r241659.
# 0f71ecf7	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Add ability to queue nvme_request objects if no nvme_trackers are available. This eliminates the need to manage queue depth at the nvd(4) level for Chatham prototype board workarounds, and also adds the ability to accept a number of requests on a single qpair that is much larger than the number of trackers allocated. Sponsored by: Intel
# 21b6da58	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Preallocate a limited number of nvme_tracker objects per qpair, rather than dynamically creating them at runtime. Sponsored by: Intel
# 5ae9ed68	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Create nvme_qpair_submit_request() which eliminates all of the code duplication between the admin and io controller-level submit functions. Sponsored by: Intel
# c2e83b40	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Simplify how the qpair lock is acquired and released. Sponsored by: Intel
# 5fa5cc5f	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Cleanup uio-related code to use struct nvme_request and nvme_ctrlr_submit_io_request(). While here, also fix case where a uio may have more than 1 iovec. NVMe's definition of SGEs (called PRPs) only allows for the first SGE to start on a non-page boundary. The simplest way to handle this is to construct a temporary uio for each iovec, and submit an NVMe request for each. Sponsored by: Intel
# d281e8fb	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Add nvme_ctrlr_submit_[admin\|io]_request functions which consolidates code for allocating nvme_tracker objects and making calls into bus_dmamap_load for commands which have payloads. Sponsored by: Intel
# ad697276	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Add struct nvme_request object which contains all of the parameters passed from an NVMe consumer. This allows us to mostly build NVMe command buffers without holding the qpair lock, and also allows for future queueing of nvme_request objects in cases where the submission queue is full and no nvme_tracker objects are available. Sponsored by: Intel
# f2b19f67	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Merge struct nvme_prp_list into struct nvme_tracker. This simplifies the driver significantly where it is constructing commands to be submitted to hardware. By reducing the number of PRPs (NVMe parlance for SGE) from 128 to 32, it ensures we do not allocate too much memory for more common smaller I/O sizes, while still supporting up to 128KB I/O sizes. This also paves the way for pre-allocation of nvme_tracker objects for each queue which will simplify the I/O path even further. Sponsored by: Intel
# 9eb93f29	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Add return codes to all functions used for submitting commands to I/O queues. Sponsored by: Intel
# 6568ebfc	10-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Count number of times each queue pair's interrupt handler is invoked. Also add sysctls to query and reset each queue pair's stats, including the new count added here. Sponsored by: Intel
# bb0ec6b3	17-Sep-2012	Jim Harris <jimharris@FreeBSD.org>	This is the first of several commits which will add NVM Express (NVMe) support to FreeBSD. A full description of the overall functionality being added is below. nvmexpress.org defines NVM Express as "an optimized register interface, command set and feature set fo PCI Express (PCIe)-based Solid-State Drives (SSDs)." This commit adds nvme(4) and nvd(4) driver source code and Makefiles to the tree. Full NVMe functionality description: Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe) device support. There will continue to be ongoing work on NVM Express support, but there is more than enough to allow for evaluation of pre-production NVM Express devices as well as soliciting feedback. Questions and feedback are welcome. nvme(4) implements NVMe hardware abstraction and is a provider of NVMe namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN. nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks. nvmecontrol(8) is used for NVMe configuration and management. The following are currently supported: nvme(4) - full mandatory NVM command set support - per-CPU IO queues (enabled by default but configurable) - per-queue sysctls for statistics and full command/completion queue dumps for debugging - registration API for NVMe namespace consumers - I/O error handling (except for timeoutsee below) - compilation switches for support back to stable-7 nvd(4) - BIO_DELETE and BIO_FLUSH (if supported by controller) - proper BIO_ORDERED handling nvmecontrol(8) - devlist: list NVMe controllers and their namespaces - identify: display controller or namespace identify data in human-readable or hex format - perftest: quick and dirty performance test to measure raw performance of NVMe device without userspace/physio/GEOM overhead The following are still work in progress and will be completed over the next 3-6 months in rough priority order: - complete man pages - firmware download and activation - asynchronous error requests - command timeout error handling - controller resets - nvmecontrol(8) log page retrieval This has been primarily tested on amd64, with light testing on i386. I would be happy to provide assistance to anyone interested in porting this to other architectures, but am not currently planning to do this work myself. Big-endian and dmamap sync for command/completion queues are the main areas that would need to be addressed. The nvme(4) driver currently has references to Chatham, which is an Intel-developed prototype board which is not fully spec compliant. These references will all be removed over time. Sponsored by: Intel Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>