Cross Reference: /freebsd-current/sys/dev/nvme/nvme

History log of /freebsd-current/sys/dev/nvme/nvme_ctrlr.c
Revision	Date	Author	Comments
# da4230af	13-May-2024	John Baldwin <jhb@FreeBSD.org>	nvme/f: Use strlcpy instead of strncpy + manual string termination Reviewed by: dab, imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D45153
# 97b77de2	16-Apr-2024	Warner Losh <imp@FreeBSD.org>	nvme: Eliminate intel_log_temp_stats_swapbytes We can't post a AER for this page, so there's no need to be able to swap it to host byte order. It's not one of the standard defined pages that can post via AER, and the vendor's public docs for this temperature page don't suggest it's possible to get over or under event changes. Since nvmecontrol no longer needsd the swap routine, remove it since it's now unused. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D44659
# b354bb04	22-Mar-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Add constants for fields in AER completion dword 0 Reviewed by: imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D44445
# 2a2682ee	06-Mar-2024	Warner Losh <imp@FreeBSD.org>	nvme: Add SMART WARNING for persistent memory region NVME 2.0 added persistent memory regions, and this bit reports critical warnings / errors with those regions. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D44213
# 5cdedf67	06-Mar-2024	Warner Losh <imp@FreeBSD.org>	nvme: Log reset success or failure to devd We're logging when we start a reset, but not when we complete it, nor the result. Create now log a success or timed_out event for the reset. Currently, the only detectable error we have from reset is 'failure to become ready in time,' though the code looks like it might be more generic. Log this and if we ever have other failure modes, change the logging to devd when that happens. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D44211
# 4f817fcf	06-Mar-2024	Warner Losh <imp@FreeBSD.org>	nvme: Change devctl events for the controller Change the devctl events slightly for the controller. SMART errors will log the changed bits in the NVME SMART Critical Warning State as its event. Reset will now emit 'event=start'. Soon more. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D44210
# fc3afe93	06-Mar-2024	Warner Losh <imp@FreeBSD.org>	nvme: split devctl out to its own function Split the devctl aspect of things out to its own function in nvme_ctrlr_devctl_log. In preparing to document this, and based on actual use, we want something different for the SMART errors, so this will facilitate that. Sponsored by: Netflix Reviewed by: chuck, mav Differential Revision: https://reviews.freebsd.org/D44209
# c5246cb7	01-Mar-2024	Warner Losh <imp@FreeBSD.org>	nvme: Report only the unknown bits When we get a smart error that's unknown, report only the unknown (reserved) bits of the Critical Warning Bitfield. Sponsored by: Netflix
# 7485926e	01-Mar-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Firmware revisions in the firmware slot info logpage are ASCII strings In particular, don't try to byteswap the values as 64-bit integers and always print a non-empty version as a string. Reviewed by: chuck, imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D44121
# 5650bd3f	29-Jan-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Use the NVMEF macro to construct fields Reviewed by: chuck, imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43605
# 8488fc41	29-Jan-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Use the NVMEM macro instead of expanded versions A few of these omitted a shift of 0, but this is more consistent. Reviewed by: chuck Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43602
# 479680f2	29-Jan-2024	John Baldwin <jhb@FreeBSD.org>	nvme: Use the NVMEV macro instead of expanded versions Reviewed by: chuck Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D43595
# b46c7b1e	27-Dec-2023	Alexander Motin <mav@FreeBSD.org>	nvme: Add some bits from NVMe 2.0c spec. MFC after: 1 week
# d9b7301b	18-Dec-2023	Mark Johnston <markj@FreeBSD.org>	nvme: Initialize HMB entries before loading them into the controller struct nvme_hmb_desc contains a pad field which was not getting initialized before being synced. This doesn't have much consequence but triggers a report from KMSAN, which verifies that host-filled DMA memory is initialized before it is made visible to the device. So, let's just initialize it properly. Reported by: KMSAN Reviewed by: mav, imp MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D43090
# 34a6ad84	17-Nov-2023	Warner Losh <imp@FreeBSD.org>	nvme: Don't use version to listen for events for ns and fw changes Instead, use the attribtue bits from the identification data to determine if we should listen to namespace changes and firmware activation. Should have no functional change, though we may stop listening for events that will never happen. Sponsored by: Netflix
# fd9a4a67	06-Nov-2023	Warner Losh <imp@FreeBSD.org>	cam: Minor opt_cam.h cleanup sys/cam/cam.h includes opt_cam.h, so none of the clients need to do this. cam.h does all the right dancing to conditionally include opt_cam.h only when it makes sense. It generally only matters when cam_debug.h is included (it must be included before that). Many of the stray opt_cam.h includes were after cam_debug.h which would be a problem were it not included in cam/cam.h. The other users of CAM options that aren't debug all already include cam/cam.h. Also trim unneeded sys/cdefs.h files from the files touched. Sponsored by: Netflix
# 8d6c0743	06-Nov-2023	Alexander Motin <mav@FreeBSD.org>	nvme: Introduce longer timeouts for admin queue KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty namespace. In some cases like hot-plug it takes longer, triggering timeout and controller resets after just 30 seconds. Linux for many years has separate 60 seconds timeout for admin queue. This patch does the same. And it is good to be consistent. Sponsored by: iXsystems, Inc. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D42454
# 6b2a6e9c	10-Oct-2023	Warner Losh <imp@FreeBSD.org>	nvme: Remove stale comment After da8324a9258f, the pre/post hooks are gone. So remove a coment about why we don't call them in this case. Sponsored by: Netflix Reviewed by: chuck, jhb Differential Revision: https://reviews.freebsd.org/D42050
# 40261289	10-Oct-2023	Warner Losh <imp@FreeBSD.org>	nvme: Really remove NVME_2X_RESET da8324a9258f removed one of the two instances of NVME_2X_RESET. It failed to snag the other one, and remove it from the options file. Remove from both of those here. Sponsored by: Netflix Reviewed by: chuck, gallatin, jhb Differential Revision: https://reviews.freebsd.org/D42049
# bc85cd30	10-Oct-2023	Warner Losh <imp@FreeBSD.org>	nvme: gc nvme_ctrlr_post_failed_request and related task stuff In 4b977e6dda92 we removed the call to nvme_ctrlr_post_failed_request because we can now directly fail requests in this context since we're in the reset task already. No need to queue it. I left it in place against future need, but it's been two years and no panics have resulted. Since the static analysis (code checking) and the dyanmic analysis (surviving in the field for 2 years, including at $WORK where we know we've gone through this path when we've failed drives) both signal that it's not really needed, go ahead and GC it. If we discover at a later date a flaw in this analysis, we can add it back easily enough by reverting this and 4b977e6dda92. Sponsored by: Netflix Reviewed by: chuck, gallatin, jhb Differential Revision: https://reviews.freebsd.org/D42048
# 7ea866eb	07-Sep-2023	David Sloan <david.sloan@eideticom.com>	nvme: Fix memory leak in pt ioctl commands When running nvme passthrough commands through the ioctl interface memory is mapped with vmapbuf() but not unmapped. This results in leaked memory whenever a process executes an nvme passthrough command with a data buffer. This can be replicated with a simple c function (error checks skipped for brevity): void leak_memory(int nvme_ns_fd, uint16_t nblocks) { struct nvme_pt_command pt = { .cmd = { .opc = NVME_OPC_READ, .cdw12 = nblocks - 1, }, .len = nblocks * 512, // Assumes devices with 512 byte lba .is_read = 1, // Reads and writes should both trigger leak } void buf; posix_memalign(&buf, nblocks 512); pt.buf = buf; ioctl(nvme_ns_fd, NVME_PASSTHROUGH_COMMAND, &pt); free(buf); } Signed-off-by: David Sloan <david.sloan@eideticom.com> PR: 273626 Reviewed by: imp, markj MFC after: 1 week
# da8324a9	24-Sep-2023	Warner Losh <imp@FreeBSD.org>	nvme: Fix locking protocol violation to fix suspend / resume Currently, when we suspend, we need to tear down all the qpairs. We call nvme_admin_qpair_abort_aers with the admin qpair lock held, but the tracker it will call for the pending AER also locks it (recursively) hitting an assert. This routine is called without the qpair lock held when we destroy the device entirely in a number of places. Add an assert to this effect and drop the qpair lock before calling it. nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the list, dropping it around calls to nvme_qpair_complete_tracker, and restarting the list scan after picking it back up. Note: If interrupts are still running, there's a tiny window for these AERs: If one fires just an instant after we manually complete it, then we'll be fine: we set the state of the queue to 'waiting' and we ignore interrupts while 'waiting'. We know we'll destroy all the queue state with these pending interrupts before looking at them again and we know all the TRs will have been completed or rescheduled. So either way we're covered. Also, tidy up the failure case as well: failing a queue is a superset of disabling it, so no need to call disable first. This solves solves some locking issues with recursion since we don't need to recurse.. Set the qpair state of failed queues to RECOVERY_FAILED and stop scheduling the watchdog. Assert we're not failed when we're enabling a qpair, since failure currently is one-way. Make failure a little less verbose. Next, kill the pre/post reset stuff. It's completely bogus since we disable the qparis, we don't need to also hold the lock through the reset: disabling will cause the ISR to return early. This keeps us from recursing on the recovery lock when resuming. We only need the recovery lock to avoid a specific race between the timer and the ISR. Finally, kill NVME_RESET_2X. It'S been a major release since we put it in and nobody has used it as far as I can tell. And it was a motivator for the pre/post uglification. These are all interrelated, so need to be done at the same time. Sponsored by: Netflix Reviewed by: jhb Tested by: jhb (made sure suspend / resume worked) MFC After: 3 days Differential Revision: https://reviews.freebsd.org/D41866
# 8052b01e	25-Aug-2023	Warner Losh <imp@FreeBSD.org>	nvme: Add exclusion for ISR Add a basically uncontended spinlock that we take out while the ISR is running. This has two effects: First, when we get a timeout, we can safely call the nvme_qpair_process_completions w/o racing any ISRs. Second, we can use it to ensure that we don't reset the card while the ISRs are active (right now we just sleep and hope for the best, which usually is fine, but not always). Sponsored by: Netflix MFC After: 2 weeks Reviewed by: chuck, gallatin Differential Revision: https://reviews.freebsd.org/D41452
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 4d846d26	10-May-2023	Warner Losh <imp@FreeBSD.org>	spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
# 6e8ab671	04-Jun-2022	Gordon Bergling <gbe@FreeBSD.org>	nvmw(4): Fix a typo in a source code comment - s/inaccessable/inaccessible/ MFC after: 3 days
# 3740a8db	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Further refinements in Host Memory Buffer Sizing Host Memory Buffer units are a mix. For those in the identify structure, the size is in 4kiB chunks. For specifying the buffer description, though, they are in terms of the drive's MPS. Add comments to this effect and change PAGE_SIZE to ctrlr->page_size where needed, as well as correct a mistaken use of NVME_HPS_UNITS in 214df80a9cb3 as pointed out by rpokala@ after the commit. No functional change is intended, as page_size is still 4k which matches all current hosts' PAGE_SIZE, but to support 16k pages on arm, we need to differentiate these two cases. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D34871
# 3086efe8	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Remove NVME_MAX_XFER_SIZE, replace inline calculation NVME_MAX_XFER_SIZE used to be a constant (back when MAXPHYS was a constant) to denote the smaller of MAXPHYS or the largest PRP we could encode with our prealloation scheme. However, it's no longer constant since MAXPHYS varies at runtime. In addition, the actual maximum is now based on the drive's currently in use page_size, which is also a runtime expression. As such, remove the define and expand it inline in the one place its used still in the tree. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34870
# 3a468f20	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Use saved mps when initializing drive Make sure we set the MPS we cached (currently the drives minimum mps) in CC (Controller Configuration) when reinitializing the drive. It must match the page_size that we're going to use. Also retire less specific NVME_PAGE_SHIFT since it's now unused. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34869
# 55412ef9	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Rename min_page_size to page_size and save mps The Memory Page Size sets the basic unit of operation for the drive. We currently set this to the drive's minimum page size, but we could set it to any page size the drive supports in the future. Replace min_page_size (it's now unused for that purpose) with page_size to reflect this and cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34868
# 6e3deec8	15-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: Base maximum data transfer size directly on MPSMIN in cap_hi Calculate the maxmimum transfer size based on the MPSMIN we have in our cached copy of cap_hi rather than using min_page_size in the controller. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34867
# 214df80a	08-Apr-2022	Warner Losh <imp@FreeBSD.org>	nvme: new define for size of host memory buffer sizes The nvme spec defines the various fields that specify sizes for host memory buffers in terms of 4096 chunks. So, rather than use a bare 4096 here, use NVME_HMB_UNITS. This is explicitly not the host page size of 4096, nor the default memory page size (mps) of the NVMe drive, but its own thing and needs its own define. No functional change is intended, only the logical spelling of 4k. Sponsored by: Netflix
# 6af6a52e	29-Mar-2022	Warner Losh <imp@FreeBSD.org>	nvme: Save cap_lo and cap_hi Save the capabilities for the drive. Sponsored by: Netflix
# a70b5660	29-Mar-2022	Warner Losh <imp@FreeBSD.org>	nvme: MPS is a power of two, not a size / 8k Setting MPS in the CC should be a power of 2 number (it specifies the page size of the host is 2^(12+MPS)), so adjust the calcuation. There is no functional change because we do not support any architecutres != 4k pages (yet). Other changes are needed for architectures with 16k or 64k pages, especially when the underlying NVMe drive doesn't support that page size (Most drives support a range that's small, and many only support 4k), but let's at least do this calculation correctly. 12 - 12 is just as much 0 as 4096 >> 13 is :) Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D34707
# 83581511	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Use adaptive spinning when polling for completion or state change We only use nvme_completion_poll in the initialization path. The commands they queue and wait for finish quickly as they involve no I/O to the drive's media. These command take about 20-200 microsecnds each. Set the wait time to 1us and then increase it by 1.5 each successive iteration (max 1ms). This reduces initialization time by 80ms in cpervica's tests. Use this same technique waiting for RDY state transitions. This saves another 20ms. In total we're down from ~330ms to ~2ms. Tested by: cperciva Sponsored by: Netflix Reviewed by: mav Differential Review: https://reviews.freebsd.org/D32259
# 4b3da659	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Only reset once on attach. The FreeBSD nvme driver has reset the nvme controller twice on attach to address a theoretical issue assuring the hardware is in a known state. However, exierence has shown the second reset is unnecessary and increases the time to boot. Eliminate the second reset. Should there be a situation when you need a second reset (for buggy or at least somewhat out of the mainstream hardware), the hardware option NVME_2X_RESET will restore the old behavior. Document this in nvme(4). If there's any trouble at all with this, I'll add a sysctl tunable to control it. Sponsored by: Netflix Reviewed by: cperciva, mav Differential Revision: https://reviews.freebsd.org/D32241
# e5e26e4a	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Remove pause while resetting After some study of the code and the standard, I think we can just drop the pause(), unconditionally. If we're not initialized, then there's nothing to wait for from a software perspective. If we are initialized, then there might be outstanding I/O. If so, then the qpair 'recovery state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which will ignore any interrupts for items that complete before we complete the reset by setting cc.en=0. If we go on to fail the controller, we'll cancel the outstanding I/O transactions. If we reset the controller, the hardware throws away pending transactions and we retry all the pending I/O transactions. Any transactions that happend to complete before cc.en=0 will have the same effect in the end (doing the same transaction twice is just inefficient, it won't affect the state of the device any differently than having done it once). The standard imposes no wait times here, so it isn't needed from that perspective. Unanswered Question: Do we may need to disable interrupts while we disable in legacy mode since those are level-sensitive. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32248
# 77054a89	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Explain a workaround a little better The don't touch the mmio of the drive after we do a EN 1->0 transition is only for a tiny number of dirves that have this unforunate issue. Sponsored by: Netflix
# a245627a	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme_ctrlr_enable: Small style nits Rewrite the nested if's using the preferred FreeBSD style for branches of ifs that return. NFC. Minor tweaks to the comments to better fit new code layout. Sponsored by: Netflix Reviewed by: mav, chuck (prior rev, but comments rolled in) Differential Revision: https://reviews.freebsd.org/D32245
# 26259f6a	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme: Use MS_2_TICKS rather than rolling our own Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32246
# d5fca1dc	01-Oct-2021	Warner Losh <imp@FreeBSD.org>	nvme_ctrlr_enable: Remove unnecessary 5ms delays Remove the 5ms delays after writing the administrative queue registers. These delays are from the very earliest days of the driver (they are in the first commit) and were most likely vestiges of the Chatham NVMe prototype card that was used to create this driver. Many of the workarounds necessary for it aren't necessary for standards compliant cards. The original driver had other areas marked for Chatham, but these were not. They are unneeded. There's three lines of supporting evidence. First, the NVMe standards make no mention of a delay time after these registers are written. Second, the Linux driver doesn't have them, even as an option. Third, all my nvme cards work w/o them. To be safe, add a write barrier between setting up the admin queue and enabling the controller. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D32247
# 502dc84a	23-Sep-2021	Warner Losh <imp@FreeBSD.org>	nvme: Use shared timeout rather than timeout per transaction Keep track of the approximate time commands are 'due' and the next deadline for a command. twice a second, wake up to see if any commands have entered timeout. If so, quiessce and then enter a recovery mode half the timeout further in the future to allow the ISR to complete. Once we exit recovery mode, we go back to operations as normal. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28583
# bad42df9	05-Sep-2021	Colin Percival <cperciva@FreeBSD.org>	Add some nvme initialization routines to TSLOG About 335 ms of EC2 instance boot time is being spent here.
# e3bdf3da	31-Aug-2021	Alexander Motin <mav@FreeBSD.org>	nvme(4): Add MSI and single MSI-X support. If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week
# 31111372	30-Aug-2021	Alexander Motin <mav@FreeBSD.org>	nvme(4): Do not panic on admin queue construct error. MFC after: 1 week
# f0f47121	28-May-2021	Warner Losh <imp@FreeBSD.org>	nvme: fix a race between failing the controller and failing requests Part of the nvme recovery process for errors is to reset the card. Sometimes, this results in failing the entire controller. When nda is in use, we free the sim, which will sleep until all the I/O has completed. However, with only one thread, the request fail task never runs once the reset thread sleeps here. Create two threads to allow I/O to fail until it's all processed and the reset task can proceed. This is a temporary kludge until I can work out questions that arose during the review, not least is what was the race that queueing to a failure task solved. The original commit is vague and other error paths in the same context do a direct failure. I'll investigate that more completely before committing changing that to a direct failure. mav@ raised this issue during the review, but didn't otherwise object. Multiple threads, though, solve the problem in the mean time until other such means can be perfected. Reviewed by: jhb@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30366
# 4fbbe523	17-Mar-2021	Alexander Motin <mav@FreeBSD.org>	nvme: Replace potentially long DELAY() with pause(). In some cases like broken hardware nvme(4) may wait minutes for controller response before timeout. Doing so in a tight spin loop made whole system unresponsive. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29309 Sponsored by: iXsystems, Inc.
# 8423f5d4	11-Mar-2021	Warner Losh <imp@FreeBSD.org>	nvme: use config_intrhook_drain to avoid removable card races nvme drives are configured early in boot. However, a number of the configuration steps takes which take a while, so we defer those to a config intrhook that runs before the root filesystem is mounted. At the same time, the PCI hot plug wakes up and tests the status of the card. It may decide that the card has gone away and deletes the child. As part of that process nvme_detach is called. If this call happens after the config_intrhook starts to run, but before it is finished, there's a race where we can tear down the device's soft state while the config_intrhook is still using it. Use the new config_intrhook_drain to disestablish the hook. Either it will be removed w/o running, or the routine will wait for it to finish. This closes the race and allows safe hotplug at any time, even very early in boot. Sponsored by: Netflix, Inc Reviewed by: jhb, mav Differential Revision: https://reviews.freebsd.org/D29006
# dd2516fc	08-Feb-2021	Warner Losh <imp@FreeBSD.org>	nvme: Make nvme_ctrlr_hw_reset static nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can.
# 9600aa31	08-Feb-2021	Warner Losh <imp@FreeBSD.org>	nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.
# 1770bae5	28-Nov-2020	Alexander Motin <mav@FreeBSD.org>	Remove aligment requirements for passthrough buffer. After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers thanks to extra page added to pbuf_zone.
# ac90f70d	28-Nov-2020	Alexander Motin <mav@FreeBSD.org>	Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# cd853791	27-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225
# 0bed3eab	13-Nov-2020	Alexander Motin <mav@FreeBSD.org>	Add PMRCAP printing and fix earlier CAP_HI. MFC after: 3 days
# 46fbd800	12-Nov-2020	Alexander Motin <mav@FreeBSD.org>	Fix panic if NVMe is detached before the intrhook call. MFC after: 1 week Sponsored by: iXsystems, Inc.
# c44441f8	28-Oct-2020	Alexander Motin <mav@FreeBSD.org>	Print NVMe controller capabilities in verbose dmesg. Those values are not reported in controller identification, while sometimes interesting for development and debugging. MFC after: 1 week
# 44ca4575	21-Oct-2020	Brooks Davis <brooks@FreeBSD.org>	vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD MFC after: 1 week Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784
# 915f0197	14-Oct-2020	Alexander Motin <mav@FreeBSD.org>	Use RTD3 Entry Latency value as shutdown timeout. This field was not in specs when the driver was written, but now there are SSDs with the reported latency of 10s, where hardcoded value of 5s seems to be not enough sometimes, causing shutdown timeout messages. MFC after: 1 week Sponsored by: iXsystems, Inc.
# e32d47f3	21-Sep-2020	David Bright <dab@FreeBSD.org>	Add an ioctl to get an NVMe device's maximum transfer size Reviewed by: imp, chuck Obtained from: Dell EMC Isilon MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26390
# d87b31e1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	nvme: clean up empty lines in .c and .h files
# 881534f0	31-Aug-2020	Warner Losh <imp@FreeBSD.org>	Use symbolic names for asych events Rather than \|= 0x300, define and use asyn event names for the name space changes and the firmware activations that we're asking for.
# 701267ad	25-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Fix few panics on NVMe's timing out initialization requests. MFC after: 1 week Sponsored by: iXsystems, Inc.
# ead7e103	18-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Make polled request timeout less invasive. Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 550d5d64	17-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Fix admin qpair leak if detached during initial reset. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 92390644	12-Jun-2020	Alexander Motin <mav@FreeBSD.org>	Fix config_intrhook leak on initial reset failure. MFC after: 1 week Sponsored by: iXsystems, Inc.
# 4053f8ac	02-May-2020	David Bright <dab@FreeBSD.org>	Fix various Coverity-detected errors in nvme driver This fixes several Coverity-detected errors in the nvme driver. CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975, 1403980 Reviewed by: imp@, vangyzen@ MFC after: 5 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24532
# 4e6a434b	29-Apr-2020	Warner Losh <imp@FreeBSD.org>	Make sure that we get the sbuf resources we need. Since we're calling sbuf_new with NOWAIT, make sure it can allocate a buffer to use. Don't print anything if we can't get it. Noticed by: rpokala
# 244b8053	29-Apr-2020	Warner Losh <imp@FreeBSD.org>	Generate a devctl event for interesting events When we reset the controller, and when the controller tells us about a critical warning, send an event.
# b2cdfb72	08-Jan-2020	Alexander Motin <mav@FreeBSD.org>	Fix copy-paste bug in HMB free code. MFC after: 2 weeks X-MFC-with: r356474
# 6de4e458	07-Jan-2020	Alexander Motin <mav@FreeBSD.org>	Minor adjustments to r356474 and r356480. Reported by: jkim, imp MFC after: 2 weeks X-MFC-with: r356474
# 1c7dd40e	07-Jan-2020	Alexander Motin <mav@FreeBSD.org>	Increate HMB limit from 1% to 5%. SSD capacity in laptops is growing faster then RAM size, so my original guess seems too low on second thought. Hopefully nobody will build large array of those crappy SSDs. MFC after: 2 weeks X-MFC-with: 356474
# 67abaee9	07-Jan-2020	Alexander Motin <mav@FreeBSD.org>	Add Host Memory Buffer support to nvme(4). This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about 1MB per 1GB on the devices I have) for its metadata cache, significantly improving random I/O performance. Device reports minimal and preferable size of the buffer. The code limits it to 1% of physical RAM by default. If the buffer can not be allocated or below minimal size, the device will just have to work without it. MFC after: 2 weeks Relnotes: yes Sponsored by: iXsystems, Inc.
# 7588c6cc	13-Dec-2019	Warner Losh <imp@FreeBSD.org>	Move to using bool instead of boolean_t While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999
# 66e59850	11-Dec-2019	Warner Losh <imp@FreeBSD.org>	Move reset to the interrutp processing stage This trims the boot time a bit more for AWS and other platforms that have nvme drives. There's no reason too do this inline. This has been in my tree a while, but IIRC I talked to Jim Harris about this at one of our face to face meetings. MFC After: 2 weeks
# 1eab19cb	23-Sep-2019	Alexander Motin <mav@FreeBSD.org>	Make nvme(4) driver some more NUMA aware. - For each queue pair precalculate CPU and domain it is bound to. If queue pairs are not per-CPU, then use the domain of the device. - Allocate most of queue pair memory from the domain it is bound to. - Bind callouts to the same CPUs as queue pair to avoid migrations. - Do not assign queue pairs to each SMT thread. It just wasted resources and increased lock congestions. - Remove fixed multiplier of CPUs per queue pair, spread them even. This allows to use more queue pairs in some hardware configurations. - If queue pair serves multiple CPUs, bind different NVMe devices to different CPUs. MFC after: 1 month Sponsored by: iXsystems, Inc.
# f93b7f95	04-Sep-2019	Warner Losh <imp@FreeBSD.org>	Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514
# 4d547561	03-Sep-2019	Warner Losh <imp@FreeBSD.org>	Implement nvme suspend / resume for pci attachment When we suspend, we need to properly shutdown the NVME controller. The controller may go into D3 state (or may have the power removed), and to properly flush the metadata to non-volatile RAM, we must complete a normal shutdown. This consists of deleting the I/O queues and setting the shutodown bit. We have to do some extra stuff to make sure we reset the software state of the queues as well. On resume, we have to reset the card twice, for reasons described in the attach funcion. Once we've done that, we can restart the card. If any of this fails, we'll fail the NVMe card, just like we do when a reset fails. Set is_resetting for the duration of the suspend / resume. This keeps the reset taskqueue from running a concurrent reset, and also is needed to prevent any hw completions from queueing more I/O to the card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get that from the global state of the ctrlr. Wait for any pending reset to finish. All queued I/O will get sent to the hardware as part of nvme_ctrlr_start(), though the upper layers shouldn't send any down. Disabling the qpairs is the other failsafe to ensure all I/O is queued. Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid confusion with all the other destroy functions. It just removes the queues in hardware, while the other _destroy_ functions tear down driver data structures. Split parts of the hardware reset function up so that I can do part of the reset in suspsend. Split out the software disabling of the qpairs into nvme_ctrlr_disable_qpairs. Finally, fix a couple of spelling errors in comments related to this. Relnotes: Yes MFC After: 1 week Reviewed by: scottl@ (prior version) Differential Revision: https://reviews.freebsd.org/D21493
# ab0681aa	02-Sep-2019	Warner Losh <imp@FreeBSD.org>	In all the places that we use the polled for completion interface, except crash dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny.
# 8e61280b	22-Aug-2019	Warner Losh <imp@FreeBSD.org>	When we have errors resetting the device before we allocate the queues, don't try to tear them down in the ctrlr_destroy path. Otherwise, we dereference queue structures that are NULL and we trap. This fix is incomplete: we leak IRQ and MSI resources when this happens. That's preferable to a crash but still should be fixed.
# f182f928	21-Aug-2019	Warner Losh <imp@FreeBSD.org>	Separate the pci attachment from the rest of nvme Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@
# 71a28181	21-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Improve NVMe hot unplug handling. If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# a6d222eb	02-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Add more random bits from NVMe 1.4. MFC after: 2 weeks
# 6c99d132	02-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Decode few more NVMe log pages. In particular: Changed Namespace List, Commands Supported and Effects, Reservation Notification, Sanitize Status. Add few new arguments to `nvmecontrol log` subcommand. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# a7bf63be	01-Aug-2019	Alexander Motin <mav@FreeBSD.org>	Add IOCTL to translate nvdX into nvmeY and NSID. While very useful by itself, it also makes `nvmecontrol` not depend on hardcoded device names parsing, that in its turn makes simple to take nvdX (and potentially any other) device names as arguments. Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them interchangeable for management purposes. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 08a607e0	25-Jul-2019	Warner Losh <imp@FreeBSD.org>	Widen the type for to. The timeout field in the CAPS register is defined to be 8 bits, so its type was uint8_t. We recently started adding 1 to it to cope with rogue devices that listed 0 timeout time (which is impossible). However, in so doing, other devices that list 0xff (for a 2 minute timeout) were broken when adding 1 overflowed. Widen the type to be uint32_t like its source register to avoid the issue. Reported by: bapt@
# 62d2cf18	18-Jul-2019	Warner Losh <imp@FreeBSD.org>	Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers. These macros make places where we extract these easier to read. The shift and mask stuff is also a bit tedious and error prone. Start with the CAP_LO and CAP_HI registers since their scope is somewhat constrained. This is style chagne only, no functional changes. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20979
# dc9df3a5	16-Jul-2019	Warner Losh <imp@FreeBSD.org>	Assume that the timeout value from the capacity is 1-based Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1 costs little and copes with those NVMe drives that report '0' in this field cheaply. This is consistent with what the Linux driver does as well.
# 9835d216	08-May-2019	Warner Losh <imp@FreeBSD.org>	rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairs Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to match init/uninit scenarios. Signed-off-by: John Meneghini <johnm@netapp.com> Submitted by: Michael Hordijk <hordijk@netapp.com> Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D19781
# 2ffd6fce	08-Mar-2019	Warner Losh <imp@FreeBSD.org>	Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431
# 45d7e233	27-Feb-2019	Warner Losh <imp@FreeBSD.org>	Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.
# 756a5412	14-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho
# 91182bcf	07-Dec-2018	Warner Losh <imp@FreeBSD.org>	Even though they are reserved, cdw2 and cdw3 can be set via nvme-cli (and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3. Sponsored by: Netflix
# 9544e6dc	21-Aug-2018	Chuck Tuffli <chuck@FreeBSD.org>	Make NVMe compatible with the original API The original NVMe API used bit-fields to represent fields in data structures defined by the specification (e.g. the op-code in the command data structure). The implementation targeted x86_64 processors and defined the bit fields for little endian dwords (i.e. 32 bits). This approach does not work as-is for big endian architectures and was changed to use a combination of bit shifts and masks to support PowerPC. Unfortunately, this changed the NVMe API and forces #ifdef's based on the OS revision level in user space code. This change reverts to something that looks like the original API, but it uses bytes instead of bit-fields inside the packed command structure. As a bonus, this works as-is for both big and little endian CPU architectures. Bump __FreeBSD_version to 1200081 due to API change Reviewed by: imp, kbowling, smh, mav Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D16404
# f439e3a4	24-May-2018	Alexander Motin <mav@FreeBSD.org>	Refactor NVMe CAM integration. - Remove layering violation, when NVMe SIM code accessed CAM internal device structures to set pointers on controller and namespace data. Instead make NVMe XPT probe fetch the data directly from hardware. - Cleanup NVMe SIM code, fixing support for multiple namespaces per controller (reporting them as LUNs) and adding controller detach support and run-time namespace change notifications. - Add initial support for namespace change async events. So far only in CAM mode, but it allows run-time namespace arrival and departure. - Add missing nvme_notify_fail_consumers() call on controller detach. Together with previous changes this allows NVMe device detach/unplug. Non-CAM mode still requires a lot of love to stay on par, but at least CAM mode code should not stay in the way so much, becoming much more self-sufficient. Reviewed by: imp MFC after: 1 month Sponsored by: iXsystems, Inc.
# c252f637	02-May-2018	Alexander Motin <mav@FreeBSD.org>	Fix LOR between controller and queue locks. Admin pass-through requests took controller lock before the queue lock, but in case of request submission to a failed controller controller lock was taken after the queue lock. Fix that by reducing the lock scopes and switching to mtx_pool locks to track pass-through request completion. Sponsored by: iXsystems, Inc.
# e134ecdc	30-Apr-2018	Alexander Motin <mav@FreeBSD.org>	Improve nvme(4) attach/detach sequences. This change allows clean device detach on attach failures and driver unload, while previous code tried to talk to already shut down controller, or even accessed resources failed to allocate. Sponsored by: iXsystems, Inc.
# 5d7fd8f7	14-Mar-2018	Warner Losh <imp@FreeBSD.org>	Fix error messages in cut and pasted code. Also, fix an unnecessary deref to get ctrlr. Noticed by: rpokala@ Sponsored by: Netflix
# 8b1e6ebe	14-Mar-2018	Warner Losh <imp@FreeBSD.org>	When tearing down a queue pair, also delete the queue entries. The NVME standard has required in section 7.2.6, since at least 1.1, that a clean shutdown is signalled by deleting the subission and the completion queues before setting the shutdown bit in CC. The 1.0 standard, apparently, did not and many of the early Intel cards didn't care. Some newer cards care, at least one whose beta firmware can scramble the card on an unclean shutdown. Linux has done this for some time. To make it possible to move forward with an evaluation of this pre-release card with wonky firmware, delete the queues on the card when we delete the qpair structures. Sponsored by: Netflix
# 0d787e9b	22-Feb-2018	Wojciech Macek <wma@FreeBSD.org>	NVMe: Add big-endian support Remove bitfields from defined structures as they are not portable. Instead use shift and mask macros in the driver and nvmecontrol application. NVMe is now working on powerpc64 host. Submitted by: Michal Stanek <mst@semihalf.com> Obtained from: Semihalf Reviewed by: imp, wma Sponsored by: IBM, QCM Technologies Differential revision: https://reviews.freebsd.org/D13916
# 29077eb4	28-Jan-2018	Warner Losh <imp@FreeBSD.org>	Use atomic load and stores to ensure that the compiler doesn't optimize away these loops. Change boolean to int to match what atomic API supplies. Remove wmb() since the atomic_store_rel() on status.done ensure the prior writes to status. It also fixes the fact that there wasn't a rmb() before reading done. This should also be more efficient since wmb() is fairly heavy weight. Sponsored by: Netflix Reviewed by: kib@, jim harris Differential Revision: https://reviews.freebsd.org/D14053
# 989c7f0b	18-Dec-2017	Warner Losh <imp@FreeBSD.org>	Although we only have one quirk at the moment, guard against the day we have more than one by checking the actual quirk bit before delaying the reset. Noticed by: rpokala@
# ce1ec9c1	18-Dec-2017	Warner Losh <imp@FreeBSD.org>	When we're disabling the nvme device, some drives have a controller bug that requires 'hands off' for a period of time (2.3s) before we check the RDY bit. Sicne this is a very odd quirk for a very limited selection of drives, do this as a quirk. This prevented a successful reset of the card when the card wedged. Also, make sure that we comply with the advice from section 3.1.5 of the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has undefined results". Short circuit when EN == RDY == desired state. Finally, fail the reset if the disable fails. This will lead to a failed device, which is what we want. (note: nda device needs work for coping with a failed device). Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D13389
# 718cf2cc	27-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/dev: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
# bb1c7be4	15-Oct-2017	Warner Losh <imp@FreeBSD.org>	Create general polling function for the nvme controller. Use it when we're doing the various pin-based interrupt modes. Adjust nvme_ctrlr_intx_handler to use nvme_ctrlr_poll. Sponsored by: Netflix Suggested by: scottl@
# 5fff95cc	20-Sep-2017	Warner Losh <imp@FreeBSD.org>	Fix queue depth for nda. 1/4 of the number of queues times queue entries is too limiting. It works up to about 4k IOPS / 3.0GB/s for hardware that can do 4.4k/3.2GB/s with nvd. 3/4 works better, though it highlights issues in the fairness of nda's choice of TRIM vs READ. That will be fixed separately.
# c02565f9	28-Aug-2017	Warner Losh <imp@FreeBSD.org>	Set the max transactions for NVMe drives better. Provided a better estimate for the number of transactions that can be pending at one time. This will be number of queues * number of trackers / 4, as suggested by Jim Harris. This gives a better estimate of the number of transactions that CAM should queue before applying back pressure. This should be revisted when we have real multi-queue support in CAM and the upper layers of the I/O stack. Sponsored by: Netflix
# 696c9502	25-Aug-2017	Warner Losh <imp@FreeBSD.org>	NVME Namespace ID is 32-bits, so widen interface to reflect that. Sponsored by: Netflix
# 824073fb	07-Mar-2017	Warner Losh <imp@FreeBSD.org>	Avoid dereferencing unintialized elements in the error path. Some drives sometimes have errors for things like setting the number of queue entries in the submission queue. The error paths taken for these drives ensure a panic dereferencing uninialized data. Sponsored by: Netflix
# a8a18dd5	07-Mar-2017	Warner Losh <imp@FreeBSD.org>	Make multi-namespace nvme drives more robust. Fix assumptions about name spaces in NVME driver. First, it assumes cdata.nn is the number of configured devices. However, it is the number of supported name spaces. Second, it assumes that there will never be more than 16 name spaces supported, but a certain drive I'm testing reports 1024. It assumes that name spaces are a tightly packed namespace, but the standard seems to indicate otherwise. Finally, it assumes that an error would be generated when quearying an unconfigured namespace. Instead, it succeeds but the identify data is all zeros. Fix these by limiting the number of name spaces we probe to 16. Remove aborting when we find one in error. When the size of the name space is zero, ignore it. This is admittedly a bandaide. The long term fix will be to participate in the enumeration and name space change protocols definfed in the NVNe standard. Sponsored by: Netflix
# a3a6c48d	02-Feb-2017	Warner Losh <imp@FreeBSD.org>	Ensure that the passthrough request will fit in MAXPHYS bytes after it has been rounded to full pages. This avoids a panic in vm_fault_quick_hold_pages due to this off-by-one error passing one page too many into vmapbuf.
# a965389b	07-Nov-2016	Scott Long <scottl@FreeBSD.org>	Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a bunch of safery belts and error handling in related codepaths. Reviewed by: jimharris Obtained from: Netflix Differential Revision: D8453
# f24c011b	10-Jun-2016	Warner Losh <imp@FreeBSD.org>	Commit the bits of nda that were missed. This should fix the build. Approved by: re@
# 361e1fb4	23-Feb-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: fix intx handler to not dereference ioq during initialization This was a regression from r293328, which deferred allocation of the controller's ioq array until after interrupts are enabled during boot. PR: 207432 Reported and tested by: Andy Carrel <wac@google.com> MFC after: 3 days Sponsored by: Intel
# 43cd6160	18-Feb-2016	Justin Hibbits <jhibbits@FreeBSD.org>	Replace several bus_alloc_resource() calls using default arguments with bus_alloc_resource_any() Since these calls only use default arguments, bus_alloc_resource_any() is the right call. Differential Revision: https://reviews.freebsd.org/D5306
# 7b036d77	11-Feb-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: avoid duplicate SET_NUM_QUEUES commands nvme(4) issues a SET_NUM_QUEUES command during device initialization to ensure enough I/O queues exists for each of the MSI-X vectors we have allocated. The SET_NUM_QUEUES command is then issued again during nvme_ctrlr_start(), to ensure that is properly set after any controller reset. At least one NVMe drive exists which fails this second SET_NUM_QUEUES command during device initialization. So change nvme_ctrlr_start() to only issue its SET_NUM_QUEUES command when it is coming out of a reset - avoiding the duplicate SET_NUM_QUEUES during device initialization. Reported by: gallatin MFC after: 3 days Sponsored by: Intel
# 9c6b5d40	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: replace NVME_CEILING macro with howmany() Suggested by: rpokala MFC after: 3 days
# 50dea2da	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: add hw.nvme.min_cpus_per_ioq tunable Due to FreeBSD system-wide limits on number of MSI-X vectors (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321), it may be desirable to allocate fewer than the maximum number of vectors for an NVMe device, in order to save vectors for other devices (usually Ethernet) that can take better advantage of them and may be probed after NVMe. This tunable is expressed in terms of minimum number of CPUs per I/O queue instead of max number of queues per controller, to allow for a more even distribution of CPUs per queue. This avoids cases where some number of CPUs have a dedicated queue, but other CPUs need to share queues. Ideally the PR referenced above will eventually be fixed and the mechanism implemented here becomes obsolete anyways. While here, fix a bug in the CPUs per I/O queue calculation to properly account for the admin queue's MSI-X vector. Reviewed by: gallatin MFC after: 3 days Sponsored by: Intel
# 2b647da7	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: do not revert o single I/O queue when per-CPU queues not possible Previously nvme(4) would revert to a signle I/O queue if it could not allocate enought interrupt vectors or NVMe submission/completion queues to have one I/O queue per core. This patch determines how to utilize a smaller number of available interrupt vectors, and assigns (as closely as possible) an equal number of cores to each associated I/O queue. MFC after: 3 days Sponsored by: Intel
# d400f790	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: break out interrupt setup code into a separate function MFC after: 3 days Sponsored by: Intel
# e5af5854	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: do not pre-allocate MSI-X IRQ resources The issue referenced here was resolved by other changes in recent commits, so this code is no longer needed. MFC after: 3 days Sponsored by: Intel
# c75ad8ce	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: remove per_cpu_io_queues from struct nvme_controller Instead just use num_io_queues to make this determination. This prepares for some future changes enabling use of multiple queues when we do not have enough queues or MSI-X vectors for one queue per CPU. MFC after: 3 days Sponsored by: Intel
# d85f84ab	07-Jan-2016	Jim Harris <jimharris@FreeBSD.org>	nvme: simplify some of the nested ifs in interrupt setup code This prepares for some follow-up commits which do more work in this area. MFC after: 3 days Sponsored by: Intel
# fade8dd7	23-Jul-2015	Jeff Roberson <jeff@FreeBSD.org>	Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers. In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division
# cbdec09c	23-Jul-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: ensure csts.rdy bit is cleared before returning from nvme_ctrlr_disable PR: 200458 MFC after: 3 days Sponsored by: Intel
# de9a58f4	23-Jul-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: properly handle case where pci_alloc_msix does not alloc all vectors Reported by: Sean Kelly <smkelly@smkelly.org> MFC after: 3 days Sponsored by: Intel
# 36b0e4ee	08-Apr-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: remove CHATHAM related code Chatham was an internal NVMe prototype board used for early driver development. MFC after: 1 week Sponsored by: Intel
# e5ce5379	08-Apr-2015	Jim Harris <jimharris@FreeBSD.org>	nvme: fall back to a smaller MSI-X vector allocation if necessary Previously, if per-CPU MSI-X vectors could not be allocated, nvme(4) would fall back to INTx with a single I/O queue pair. This change will still fall back to a single I/O queue pair, but allocate MSI-X vectors instead of reverting to INTx. MFC after: 1 week Sponsored by: Intel
# f42ca756	18-Mar-2014	Jim Harris <jimharris@FreeBSD.org>	nvme: Allocate all MSI resources up front so that we can fall back to INTx if necessary. Sponsored by: Intel MFC after: 3 days
# 496a2752	18-Mar-2014	Jim Harris <jimharris@FreeBSD.org>	nvme: Close hole where nvd(4) would not be notified of all nvme(4) instances if modules loaded during boot. Sponsored by: Intel MFC after: 3 days
# 2b26030c	17-Mar-2014	Jim Harris <jimharris@FreeBSD.org>	nvme: Remove the software progress marker SET_FEATURE command during controller initialization. The spec says OS drivers should send this command after controller initialization completes successfully, but other NVMe OS drivers are not sending this command. This change will therefore reduce differences between the FreeBSD and other OS drivers. Sponsored by: Intel MFC after: 3 days
# 448cffc8	06-Jan-2014	Jim Harris <jimharris@FreeBSD.org>	For IDENTIFY passthrough commands to Chatham prototype controllers, copy the spoofed identify data into the user buffer rather than issuing the command to the controller, since Chatham IDENTIFY data is always spoofed. While here, fix a bug in the spoofed data for Chatham submission and completion queue entry sizes. Sponsored by: Intel MFC after: 3 days
# d603c3d7	01-Nov-2013	Jim Harris <jimharris@FreeBSD.org>	Create a unique unit number for each controller and namespace cdev. Sponsored by: Intel MFC after: 3 days
# bb2f67fd	08-Oct-2013	Jim Harris <jimharris@FreeBSD.org>	Log and then disable asynchronous notification of persistent events after they occur. This prevents repeated notifications of the same event. Status of these events may be viewed at any time by viewing the SMART/Health Info Page using nvmecontrol, whether or not asynchronous events notifications for those events are enabled. This log page can be viewed using: nvmecontrol logpage -p 2 <ctrlr id> Future enhancements may re-enable these notifications on a periodic basis so that if the notified condition persists, it will continue to be logged. Sponsored by: Intel Reviewed by: carl Approved by: re (hrs) MFC after: 1 week
# d5fc9821	08-Oct-2013	Jim Harris <jimharris@FreeBSD.org>	Do not enable temperature threshold as an asynchronous event notification on NVMe controllers that do not support it. Sponsored by: Intel Reviewed by: carl Approved by: re (hrs) MFC after: 1 week
# 56183abc	13-Aug-2013	Jim Harris <jimharris@FreeBSD.org>	Send a shutdown notification in the driver unload path, to ensure notification gets sent in cases where system shuts down with driver unloaded. Sponsored by: Intel Reviewed by: carl MFC after: 3 days
# 8e0ac13f	17-Jul-2013	Jim Harris <jimharris@FreeBSD.org>	Use pause() instead of DELAY() when polling for completion of admin commands during controller initialization. DELAY() does not work here during config_intrhook context - we need to explicitly relinquish the CPU for the admin command completion to get processed. Sponsored by: Intel Reported by: Adam Brooks <adam.j.brooks@intel.com> Reviewed by: carl MFC after: 3 days
# e9efbc13	09-Jul-2013	Jim Harris <jimharris@FreeBSD.org>	Update copyright dates. MFC after: 3 days
# ec526ea9	09-Jul-2013	Jim Harris <jimharris@FreeBSD.org>	Do not retry failed async event requests. Sponsored by: Intel MFC after: 3 days
# 7b68ae1e	26-Jun-2013	Jim Harris <jimharris@FreeBSD.org>	Fail any passthrough command whose transfer size exceeds the controller's max transfer size. This guards against rogue commands coming in from userspace. Also add KASSERTS for the virtual address and unmapped bio cases, if the transfer size exceeds the controller's max transfer size. Sponsored by: Intel MFC after: 3 days
# 8d09e3c4	26-Jun-2013	Jim Harris <jimharris@FreeBSD.org>	Use MAXPHYS to specify the maximum I/O size for nvme(4). Also allow admin commands to transfer up to this maximum I/O size, rather than the artificial limit previously imposed. The larger I/O size is very beneficial for upcoming firmware download support. This has the added benefit of simplifying the code since both admin and I/O commands now use the same maximum I/O size. Sponsored by: Intel MFC after: 3 days
# 5076698e	12-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Remove the NVME_IDENTIFY_CONTROLLER and NVME_IDENTIFY_NAMESPACE IOCTLs and replace them with the NVMe passthrough equivalent. Sponsored by: Intel
# 7c3f19d7	12-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Add support for passthrough NVMe commands. This includes a new IOCTL to support a generic method for nvmecontrol(8) to pass IDENTIFY, GET_LOG_PAGE, GET_FEATURES and other commands to the controller, rather than separate IOCTLs for each. Sponsored by: Intel
# a90b8104	12-Apr-2013	Jim Harris <jimharris@FreeBSD.org>	Rename the controller's fail_req_lock, so that it can be used for other locking operations on the controller. Sponsored by: Intel
# 1e526bc4	29-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or NULL. This simplifies decisions around if/how requests are routed through busdma. It also paves the way for supporting unmapped bios. Sponsored by: Intel
# bb852ae8	28-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Delete extra IO qpairs allocated based on number of MSI-X vectors, but later found to not be usable because the controller doesn't support the same number of queues. This is not the normal case, but does occur with the Chatham prototype board. Sponsored by: Intel
# 547d523e	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Clean up debug prints. 1) Consistently use device_printf. 2) Make dump_completion and dump_command into something more human-readable. Sponsored by: Intel Reviewed by: carl
# 237d2019	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Change a number of malloc(9) calls to use M_WAITOK instead of M_NOWAIT. Sponsored by: Intel Suggested by: carl Reviewed by: carl
# 955910a9	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Replace usages of mtx_pool_find used for admin commands with a polling mechanism. Now that all requests are timed, we are guaranteed to get a completion notification, even if it is an abort status due to a timed out admin command. This has the effect of simplifying the controller and namespace setup code, so that it reads straight through rather than broken up into a bunch of different callback functions. Sponsored by: Intel Reviewed by: carl
# 232e2edb	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add the ability to internally mark a controller as failed, if it is unable to start or reset. Also add a notifier for NVMe consumers for controller fail conditions and plumb this notifier for nvd(4) to destroy the associated GEOM disks when a failure occurs. This requires a bit of work to cover the races when a consumer is sending I/O requests to a controller that is transitioning to the failed state. To help cover this condition, add a task to defer completion of I/Os submitted to a failed controller, so that the consumer will still always receive its completions in a different context than the submission. Sponsored by: Intel Reviewed by: carl
# 3d7eb41c	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Just disable the controller instead of deleting IO queues during detach. This is just as effective, and removes the need for a bunch of admin commands to a controller that's going to be disabled shortly anyways. Sponsored by: Intel Reviewed by: carl
# 74019d4b	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Set Pre-boot Software Load Count to 0 at the end of the controller start process. The spec indicates the OS driver should use Set Features (Software Progress Marker) to set the pre-boot software load count to 0 after the OS driver has successfully been initialized. This allows pre-boot software to determine if there have been any issues with the OS loading. Sponsored by: Intel Reviewed by: carl
# be34f216	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Remove the is_started flag from struct nvme_controller. This flag was originally added to communicate to the sysctl code which oids should be built, but there are easier ways to do this. This needs to be cleaned up prior to adding new controller states - for example, controller failure. Sponsored by: Intel Reviewed by: carl
# 02e33484	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Ensure the controller's MDTS is accounted for in max_xfer_size. The controller's IDENTIFY data contains MDTS (Max Data Transfer Size) to allow the controller to specify the maximum I/O data transfer size. nvme(4) already provides a default maximum, but make sure it does not exceed what MDTS reports. Sponsored by: Intel Reviewed by: carl
# cb5b7c13	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Cap the number of retry attempts to a configurable number. This ensures that if a specific I/O repeatedly times out, we don't retry it indefinitely. The default number of retries will be 4, but is adjusted using hw.nvme.retry_count. Sponsored by: Intel Reviewed by: carl
# 0d7e13ec	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Pass associated log page data to async event consumers, if requested. Sponsored by: Intel Reviewed by: carl
# 2868353a	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	When an asynchronous event request is completed, automatically fetch the specified log page. This satisfies the spec condition that future async events of the same type will not be sent until the associated log page is fetched. Sponsored by: Intel Reviewed by: carl
# cf81529c	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Create struct nvme_status. NVMe error log entries include status, so breaking this out into its own data structure allows it to be included in both the nvme_completion data structure as well as error log entry data structures. While here, expose nvme_completion_is_error(), and change all of the places that were explicitly looking at sc/sct bits to use this macro instead. Sponsored by: Intel Reviewed by: carl
# f37c22a3	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Make nvme_ctrlr_reset a nop if a reset is already in progress. This protects against cases where a controller crashes with multiple I/O outstanding, each timing out and requesting controller resets simultaneously. While here, remove a debugging printf from a previous commit, and add more logging around I/O that need to be resubmitted after a controller reset. Sponsored by: Intel Reviewed by: carl
# 48ce3178	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	By default, always escalate to controller reset when an I/O times out. While aborts are typically cleaner than a full controller reset, many times an I/O timeout indicates other controller-level issues where aborts may not work. NVMe drivers for other operating systems are also defaulting to controller reset rather than aborts for timed out I/O. Sponsored by: Intel Reviewed by: carl
# 94143332	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add a tunable for the I/O timeout interval. Default is still 30 seconds, but can be adjusted between a min/max of 5 and 120 seconds. Sponsored by: Intel Reviewed by: carl
# 12d191ec	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add handling for controller fatal status (csts.cfs). On any I/O timeout, check for csts.cfs==1. If set, the controller is reporting fatal status and we reset the controller immediately, rather than trying to abort the timed out command. This changeset also includes deferring the controller start portion of the reset to a separate task. This ensures we are always performing a controller start operation from a consistent context. Sponsored by: Intel Reviewed by: carl
# dbba7442	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add API for nvme consumers to access controller and namespace identify data. Sponsored by: Intel Reviewed by: carl
# b846efd7	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add controller reset capability to nvme(4) and ability to explicitly invoke it from nvmecontrol(8). Controller reset will be performed in cases where I/O are repeatedly timing out, the controller reports an unrecoverable condition, or when explicitly requested via IOCTL or an nvme consumer. Since the controller may be in such a state where it cannot even process queue deletion requests, we will perform a controller reset without trying to clean up anything on the controller first. Sponsored by: Intel Reviewed by: carl
# 038a5ee4	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Add an interface for nvme shim drivers (i.e. nvd) to register for notifications when new nvme controllers are added to the system. Sponsored by: Intel
# 0a0b08cc	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Enable asynchronous event requests on non-Chatham devices. Also add logic to clean up all outstanding asynchronous event requests when resetting or shutting down the controller, since these requests will not be explicitly completed by the controller itself. Sponsored by: Intel
# 990e741c	26-Mar-2013	Jim Harris <jimharris@FreeBSD.org>	Move controller destruction code from nvme_detach() to new nvme_ctrlr_destruct() function. Sponsored by: Intel
# 4b52061e	07-Mar-2013	David E. O'Brien <obrien@FreeBSD.org>	Fix GCC build: /usr/src/sys/modules/nvme/../../dev/nvme/nvme.c:211: warning: format '%qx' expects type 'long unsigned int', but argument 9 has type 'long long unsigned int' [-Wformat]
# 91fe20e3	18-Dec-2012	Jim Harris <jimharris@FreeBSD.org>	Map BAR 4/5, because NVMe spec says devices may place the MSI-X table behind BAR 4/5, rather than in BAR 0/1 with the control/doorbell registers. Sponsored by: Intel
# 4d6abcb1	18-Dec-2012	Jim Harris <jimharris@FreeBSD.org>	Do not use taskqueue to defer completion work when using INTx. INTx now matches MSI-X behavior. Sponsored by: Intel
# 21b6da58	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Preallocate a limited number of nvme_tracker objects per qpair, rather than dynamically creating them at runtime. Sponsored by: Intel
# 5ae9ed68	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Create nvme_qpair_submit_request() which eliminates all of the code duplication between the admin and io controller-level submit functions. Sponsored by: Intel
# c2e83b40	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Simplify how the qpair lock is acquired and released. Sponsored by: Intel
# 5fa5cc5f	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Cleanup uio-related code to use struct nvme_request and nvme_ctrlr_submit_io_request(). While here, also fix case where a uio may have more than 1 iovec. NVMe's definition of SGEs (called PRPs) only allows for the first SGE to start on a non-page boundary. The simplest way to handle this is to construct a temporary uio for each iovec, and submit an NVMe request for each. Sponsored by: Intel
# d281e8fb	17-Oct-2012	Jim Harris <jimharris@FreeBSD.org>	Add nvme_ctrlr_submit_[admin\|io]_request functions which consolidates code for allocating nvme_tracker objects and making calls into bus_dmamap_load for commands which have payloads. Sponsored by: Intel
# 8a382371	18-Sep-2012	Jim Harris <jimharris@FreeBSD.org>	Add #if 0 around nvme_async_event_cb() until NVMe AER functionality can be tested. This fixes a build warning found only with clang.
# bb0ec6b3	17-Sep-2012	Jim Harris <jimharris@FreeBSD.org>	This is the first of several commits which will add NVM Express (NVMe) support to FreeBSD. A full description of the overall functionality being added is below. nvmexpress.org defines NVM Express as "an optimized register interface, command set and feature set fo PCI Express (PCIe)-based Solid-State Drives (SSDs)." This commit adds nvme(4) and nvd(4) driver source code and Makefiles to the tree. Full NVMe functionality description: Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe) device support. There will continue to be ongoing work on NVM Express support, but there is more than enough to allow for evaluation of pre-production NVM Express devices as well as soliciting feedback. Questions and feedback are welcome. nvme(4) implements NVMe hardware abstraction and is a provider of NVMe namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN. nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks. nvmecontrol(8) is used for NVMe configuration and management. The following are currently supported: nvme(4) - full mandatory NVM command set support - per-CPU IO queues (enabled by default but configurable) - per-queue sysctls for statistics and full command/completion queue dumps for debugging - registration API for NVMe namespace consumers - I/O error handling (except for timeoutsee below) - compilation switches for support back to stable-7 nvd(4) - BIO_DELETE and BIO_FLUSH (if supported by controller) - proper BIO_ORDERED handling nvmecontrol(8) - devlist: list NVMe controllers and their namespaces - identify: display controller or namespace identify data in human-readable or hex format - perftest: quick and dirty performance test to measure raw performance of NVMe device without userspace/physio/GEOM overhead The following are still work in progress and will be completed over the next 3-6 months in rough priority order: - complete man pages - firmware download and activation - asynchronous error requests - command timeout error handling - controller resets - nvmecontrol(8) log page retrieval This has been primarily tested on amd64, with light testing on i386. I would be happy to provide assistance to anyone interested in porting this to other architectures, but am not currently planning to do this work myself. Big-endian and dmamap sync for command/completion queues are the main areas that would need to be addressed. The nvme(4) driver currently has references to Chatham, which is an Intel-developed prototype board which is not fully spec compliant. These references will all be removed over time. Sponsored by: Intel Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>