#
d09ee08f |
|
24-May-2024 |
Warner Losh <imp@FreeBSD.org> |
nvme: Count number of alginment splits When possible, we split up I/Os to NVMe drives that advertise a preferred alignment. Add a counter for this. Sponsored by: Netflix Reviewed by: chuck, mav Differential Revision: https://reviews.freebsd.org/D45311
|
#
1931b75e |
|
22-Mar-2024 |
John Baldwin <jhb@FreeBSD.org> |
nvme: Export constants for min and max queue sizes These are useful for NVMe over Fabrics. Reviewed by: imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D44441
|
#
8d6c0743 |
|
06-Nov-2023 |
Alexander Motin <mav@FreeBSD.org> |
nvme: Introduce longer timeouts for admin queue KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty namespace. In some cases like hot-plug it takes longer, triggering timeout and controller resets after just 30 seconds. Linux for many years has separate 60 seconds timeout for admin queue. This patch does the same. And it is good to be consistent. Sponsored by: iXsystems, Inc. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D42454
|
#
9cd7b624 |
|
10-Oct-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Eliminate RECOVERY_FAILED state While it seemed like a good idea to have this state, we can do everything we wanted with the state by checking ctrlr->is_failed since that's set before we start failing the qpairs. Add some comments about racing when we're failing the controller, though in practice I'm not sure that kind of race could even be lost. Sponsored by: Netflix Reviewed by: chuck, gallatin, jhb Differential Revision: https://reviews.freebsd.org/D42051
|
#
bc85cd30 |
|
10-Oct-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: gc nvme_ctrlr_post_failed_request and related task stuff In 4b977e6dda92 we removed the call to nvme_ctrlr_post_failed_request because we can now directly fail requests in this context since we're in the reset task already. No need to queue it. I left it in place against future need, but it's been two years and no panics have resulted. Since the static analysis (code checking) and the dyanmic analysis (surviving in the field for 2 years, including at $WORK where we know we've gone through this path when we've failed drives) both signal that it's not really needed, go ahead and GC it. If we discover at a later date a flaw in this analysis, we can add it back easily enough by reverting this and 4b977e6dda92. Sponsored by: Netflix Reviewed by: chuck, gallatin, jhb Differential Revision: https://reviews.freebsd.org/D42048
|
#
da8324a9 |
|
24-Sep-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Fix locking protocol violation to fix suspend / resume Currently, when we suspend, we need to tear down all the qpairs. We call nvme_admin_qpair_abort_aers with the admin qpair lock held, but the tracker it will call for the pending AER also locks it (recursively) hitting an assert. This routine is called without the qpair lock held when we destroy the device entirely in a number of places. Add an assert to this effect and drop the qpair lock before calling it. nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the list, dropping it around calls to nvme_qpair_complete_tracker, and restarting the list scan after picking it back up. Note: If interrupts are still running, there's a tiny window for these AERs: If one fires just an instant after we manually complete it, then we'll be fine: we set the state of the queue to 'waiting' and we ignore interrupts while 'waiting'. We know we'll destroy all the queue state with these pending interrupts before looking at them again and we know all the TRs will have been completed or rescheduled. So either way we're covered. Also, tidy up the failure case as well: failing a queue is a superset of disabling it, so no need to call disable first. This solves solves some locking issues with recursion since we don't need to recurse.. Set the qpair state of failed queues to RECOVERY_FAILED and stop scheduling the watchdog. Assert we're not failed when we're enabling a qpair, since failure currently is one-way. Make failure a little less verbose. Next, kill the pre/post reset stuff. It's completely bogus since we disable the qparis, we don't need to also hold the lock through the reset: disabling will cause the ISR to return early. This keeps us from recursing on the recovery lock when resuming. We only need the recovery lock to avoid a specific race between the timer and the ISR. Finally, kill NVME_RESET_2X. It'S been a major release since we put it in and nobody has used it as far as I can tell. And it was a motivator for the pre/post uglification. These are all interrelated, so need to be done at the same time. Sponsored by: Netflix Reviewed by: jhb Tested by: jhb (made sure suspend / resume worked) MFC After: 3 days Differential Revision: https://reviews.freebsd.org/D41866
|
#
8052b01e |
|
25-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Add exclusion for ISR Add a basically uncontended spinlock that we take out while the ISR is running. This has two effects: First, when we get a timeout, we can safely call the nvme_qpair_process_completions w/o racing any ISRs. Second, we can use it to ensure that we don't reset the card while the ISRs are active (right now we just sleep and hope for the best, which usually is fine, but not always). Sponsored by: Netflix MFC After: 2 weeks Reviewed by: chuck, gallatin Differential Revision: https://reviews.freebsd.org/D41452
|
#
d4959bfc |
|
25-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Greatly improve error recovery Next phase of error recovery: Eliminate the REOVERY_START phase, since we don't need to wait to start recovery. Eliminate the RECOVERY_RESET phase since it is transient, we now transition from RECOVERY_NORMAL into RECOVERY_WAITING. In normal mode, read the status of the controller. If it is in failed state, or appears to be hot-plugged, jump directly to reset which will sort out the proper things to do. This will cause all pending I/O to complete with an abort status before the reset. When in the NORMAL state, call the interrupt handler. This will complete all pending transactions when interrupts are broken or temporarily misbehaving. We then check all the pending completions for timeouts. If we have abort enabled, then we'll send an abort. Otherwise we'll assume the controller is wedged and needs a reset. By calling the interrupt handler here, we'll avoid an issue with the current code where we transitioned to RECOVERY_START which prevented any completions from happening. Now completions happen. In addition and follow-on I/O that is scheduled in the completion routines will be submitted, rather than queued, because the recovery state is correct. This also fixes a problem where I/O would timeout, but never complete, leading to hung I/O. Resetting remains the same as before, just when we chose to reset has changed. A nice side effect of these changes is that we now do I/O when interrupts to the card are totally broken. Followon commits will improve the error reporting and logging when this happens. Performance will be aweful, but will at least be minimally functional. There is a small race when we're checking the completions if interrupts are working, but this is handled in a future commit. Sponsored by: Netflix MFC After: 2 weeks Differential Revision: https://reviews.freebsd.org/D36922
|
#
95ee2897 |
|
16-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
sys: Remove $FreeBSD$: two-line .h pattern Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/
|
#
33469f10 |
|
14-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: use mtx_padaalign instead of mtx + alignment attribute nvme driver predates, it seems, mtx_padalign. Modernize. Sponsored by: Netflix
|
#
09c20a29 |
|
08-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Move bools to fill hole The two bools in nvme_request create a 6 byte hole today. Move them to after retires to fill the 4 byte hole there and add a spare[2] to make nvme_request 8 bytes smaller. spare[2] isn't strictly necessary, but documents how many bytes we have left in that hole, as the number of booleans will increase shortly. Suggested by: chuck Sponsored by: Netflix
|
#
7be0b068 |
|
07-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Remove duplicate command printing routine Both nvme_dump_command and nvme_qpair_print_command print nvme commands. The former latter better. Recode the one call to nvme_dump_command to use nvme_qpair_print_command and delete the former. No sense having two nearly identical routines. A future commit will convert to sbuf. Sponsored by: Netflix Reviewed by: chuck, mav, jhb Differential Revision: https://reviews.freebsd.org/D41309
|
#
6f76d493 |
|
07-Aug-2023 |
Warner Losh <imp@FreeBSD.org> |
nvme: Remove duplicate completion printing routine Both nvme_dump_completion and nvme_qpair_print_completion print completions. The latter is better. Recode the two instances of nvme_dump_completion to use nvme_qpair_print_completion and delete the former. No sense having two nearly identical routines. A future commit will convert this to sbuf. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D41308
|
#
92103adb |
|
24-Jul-2023 |
John Baldwin <jhb@FreeBSD.org> |
nvme: Use a memdesc for the request buffer instead of a bespoke union. This avoids encoding CAM-specific knowledge in nvme_qpair.c. Reviewed by: chuck, imp, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D41119
|
#
4d846d26 |
|
10-May-2023 |
Warner Losh <imp@FreeBSD.org> |
spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch up to that fact and revert to their recommended match of BSD-2-Clause. Discussed with: pfg MFC After: 3 days Sponsored by: Netflix
|
#
1093caa1 |
|
06-May-2022 |
John Baldwin <jhb@FreeBSD.org> |
nvme: Remove unused devclass arguments to DRIVER_MODULE.
|
#
3a468f20 |
|
15-Apr-2022 |
Warner Losh <imp@FreeBSD.org> |
nvme: Use saved mps when initializing drive Make sure we set the MPS we cached (currently the drives minimum mps) in CC (Controller Configuration) when reinitializing the drive. It must match the page_size that we're going to use. Also retire less specific NVME_PAGE_SHIFT since it's now unused. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34869
|
#
55412ef9 |
|
15-Apr-2022 |
Warner Losh <imp@FreeBSD.org> |
nvme: Rename min_page_size to page_size and save mps The Memory Page Size sets the basic unit of operation for the drive. We currently set this to the drive's minimum page size, but we could set it to any page size the drive supports in the future. Replace min_page_size (it's now unused for that purpose) with page_size to reflect this and cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size. Sponsored by: Netflix Reviewed by: chuck Differential Revision: https://reviews.freebsd.org/D34868
|
#
6af6a52e |
|
29-Mar-2022 |
Warner Losh <imp@FreeBSD.org> |
nvme: Save cap_lo and cap_hi Save the capabilities for the drive. Sponsored by: Netflix
|
#
a70b5660 |
|
29-Mar-2022 |
Warner Losh <imp@FreeBSD.org> |
nvme: MPS is a power of two, not a size / 8k Setting MPS in the CC should be a power of 2 number (it specifies the page size of the host is 2^(12+MPS)), so adjust the calcuation. There is no functional change because we do not support any architecutres != 4k pages (yet). Other changes are needed for architectures with 16k or 64k pages, especially when the underlying NVMe drive doesn't support that page size (Most drives support a range that's small, and many only support 4k), but let's at least do this calculation correctly. 12 - 12 is just as much 0 as 4096 >> 13 is :) Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D34707
|
#
7cf8d63c |
|
06-Dec-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme_ahci: Mark AHCI devices as such in the controller Add a quirk to flag AHCI attachment to the controller. This is for any of the strategies for attaching nvme devices as children of the AHCI device for Intel's RAID devices. This also has a side effect of cleaning up resource allocation from failed nvme_attach calls now. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D33285
|
#
053f8ed6 |
|
06-Dec-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: Move to a quirk for the Intel alignment data Prior to NVMe 1.3, Intel produced a series of drives that had performance alignment data in the vendor specific space since no standard had been defined. Move testing the versions to a quick so the NVMe NS code doesn't know about PCI device info. Sponsored by: Netflix Reviewed by: mav Differential Revision: https://reviews.freebsd.org/D33284
|
#
83581511 |
|
01-Oct-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: Use adaptive spinning when polling for completion or state change We only use nvme_completion_poll in the initialization path. The commands they queue and wait for finish quickly as they involve no I/O to the drive's media. These command take about 20-200 microsecnds each. Set the wait time to 1us and then increase it by 1.5 each successive iteration (max 1ms). This reduces initialization time by 80ms in cpervica's tests. Use this same technique waiting for RDY state transitions. This saves another 20ms. In total we're down from ~330ms to ~2ms. Tested by: cperciva Sponsored by: Netflix Reviewed by: mav Differential Review: https://reviews.freebsd.org/D32259
|
#
587aa255 |
|
28-Sep-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: count number of ignored interrupts Count the number of times we're asked to process completions, but that we ignore because the state of the qpair isn't in RECOVERY_NONE. Sponsored by: Netflix Reviewed by: mav, chuck Differential Revision: https://reviews.freebsd.org/D32212
|
#
502dc84a |
|
23-Sep-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: Use shared timeout rather than timeout per transaction Keep track of the approximate time commands are 'due' and the next deadline for a command. twice a second, wake up to see if any commands have entered timeout. If so, quiessce and then enter a recovery mode half the timeout further in the future to allow the ISR to complete. Once we exit recovery mode, we go back to operations as normal. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28583
|
#
e3bdf3da |
|
31-Aug-2021 |
Alexander Motin <mav@FreeBSD.org> |
nvme(4): Add MSI and single MSI-X support. If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week
|
#
dd2516fc |
|
08-Feb-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: Make nvme_ctrlr_hw_reset static nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can.
|
#
9600aa31 |
|
08-Feb-2021 |
Warner Losh <imp@FreeBSD.org> |
nvme: use NVME_GONE rather than hard-coded 0xffffffff Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings.
|
#
ac90f70d |
|
28-Nov-2020 |
Alexander Motin <mav@FreeBSD.org> |
Increase nvme(4) maximum transfer size from 1MB to 2MB. With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Reviewed by: imp MFC after: 2 weeks Sponsored by: iXsystems, Inc.
|
#
91387707 |
|
23-Nov-2020 |
Michal Meloun <mmel@FreeBSD.org> |
Ensure that the buffer is in nvme_single_map() mapped to single segment. Not a functional change. MFC after: 1 week
|
#
71460dfc |
|
05-Nov-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
nvme: change namei_request_zone into a malloc type Both the size (128 bytes) and ephemeral nature of allocations make it a great fit for malloc. A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D27103
|
#
d87b31e1 |
|
01-Sep-2020 |
Mateusz Guzik <mjg@FreeBSD.org> |
nvme: clean up empty lines in .c and .h files
|
#
ead7e103 |
|
18-Jun-2020 |
Alexander Motin <mav@FreeBSD.org> |
Make polled request timeout less invasive. Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. MFC after: 1 week Sponsored by: iXsystems, Inc.
|
#
67abaee9 |
|
07-Jan-2020 |
Alexander Motin <mav@FreeBSD.org> |
Add Host Memory Buffer support to nvme(4). This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about 1MB per 1GB on the devices I have) for its metadata cache, significantly improving random I/O performance. Device reports minimal and preferable size of the buffer. The code limits it to 1% of physical RAM by default. If the buffer can not be allocated or below minimal size, the device will just have to work without it. MFC after: 2 weeks Relnotes: yes Sponsored by: iXsystems, Inc.
|
#
7588c6cc |
|
13-Dec-2019 |
Warner Losh <imp@FreeBSD.org> |
Move to using bool instead of boolean_t While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999
|
#
1eab19cb |
|
23-Sep-2019 |
Alexander Motin <mav@FreeBSD.org> |
Make nvme(4) driver some more NUMA aware. - For each queue pair precalculate CPU and domain it is bound to. If queue pairs are not per-CPU, then use the domain of the device. - Allocate most of queue pair memory from the domain it is bound to. - Bind callouts to the same CPUs as queue pair to avoid migrations. - Do not assign queue pairs to each SMT thread. It just wasted resources and increased lock congestions. - Remove fixed multiplier of CPUs per queue pair, spread them even. This allows to use more queue pairs in some hardware configurations. - If queue pair serves multiple CPUs, bind different NVMe devices to different CPUs. MFC after: 1 month Sponsored by: iXsystems, Inc.
|
#
f93b7f95 |
|
04-Sep-2019 |
Warner Losh <imp@FreeBSD.org> |
Support doorbell strides != 0. The NVMe standard (1.4) states >>> 8.6 Doorbell Stride for Software Emulation >>> The doorbell stride,...is useful in software emulation of an NVM >>> Express controller. ... For hardware implementations of the NVM >>> Express interface, the expected doorbell stride value is 0h. However, hardware in the wild exists with a doorbell stride of 1 (meaning 8 byte separation). This change supports that hardware, as well as software emulators as envisioned in Section 8.6. Since this is the fast path, care has been taken to make this computation efficient. The bit of math to compute an offset for each is replaced by a memory load from cache of a pre-computed value. MFC After: 3 days Reviewed by: scottl@ Differential Revision: https://reviews.freebsd.org/D21514
|
#
4d547561 |
|
03-Sep-2019 |
Warner Losh <imp@FreeBSD.org> |
Implement nvme suspend / resume for pci attachment When we suspend, we need to properly shutdown the NVME controller. The controller may go into D3 state (or may have the power removed), and to properly flush the metadata to non-volatile RAM, we must complete a normal shutdown. This consists of deleting the I/O queues and setting the shutodown bit. We have to do some extra stuff to make sure we reset the software state of the queues as well. On resume, we have to reset the card twice, for reasons described in the attach funcion. Once we've done that, we can restart the card. If any of this fails, we'll fail the NVMe card, just like we do when a reset fails. Set is_resetting for the duration of the suspend / resume. This keeps the reset taskqueue from running a concurrent reset, and also is needed to prevent any hw completions from queueing more I/O to the card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get that from the global state of the ctrlr. Wait for any pending reset to finish. All queued I/O will get sent to the hardware as part of nvme_ctrlr_start(), though the upper layers shouldn't send any down. Disabling the qpairs is the other failsafe to ensure all I/O is queued. Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid confusion with all the other destroy functions. It just removes the queues in hardware, while the other _destroy_ functions tear down driver data structures. Split parts of the hardware reset function up so that I can do part of the reset in suspsend. Split out the software disabling of the qpairs into nvme_ctrlr_disable_qpairs. Finally, fix a couple of spelling errors in comments related to this. Relnotes: Yes MFC After: 1 week Reviewed by: scottl@ (prior version) Differential Revision: https://reviews.freebsd.org/D21493
|
#
31b11bb3 |
|
02-Sep-2019 |
Warner Losh <imp@FreeBSD.org> |
In nvme_completion_poll, add a sanity check to make sure that we complete the polling within a second. Panic if we don't. All the commands that use this interface should typically complete within a few tens to hundreds of microseconds. Panic rather than return ETIMEDOUT because if the command somehow does later complete, it will randomly corrupt memory. Also, it helps to get a traceback from where the unexpected failure happens, rather than an infinite loop.
|
#
ab0681aa |
|
02-Sep-2019 |
Warner Losh <imp@FreeBSD.org> |
In all the places that we use the polled for completion interface, except crash dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny.
|
#
f182f928 |
|
21-Aug-2019 |
Warner Losh <imp@FreeBSD.org> |
Separate the pci attachment from the rest of nvme Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@
|
#
97be8b96 |
|
14-Aug-2019 |
Alexander Motin <mav@FreeBSD.org> |
Report NOIOB and NPWG fields as stripe size. Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace Preferred Write Granularity added in 1.4 allow upper layers to align I/Os for improved SSD performance and endurance. I don't have hardware reportig those yet, but NPWG could probably be reported by bhyve. MFC after: 2 weeks Sponsored by: iXsystems, Inc.
|
#
5e83c2ff |
|
19-Jul-2019 |
Warner Losh <imp@FreeBSD.org> |
Keep track of the number of commands that exhaust their retry limit. While we print failure messages on the console, sometimes logs are lost or overwhelmed. Keeping a count of how many times we've failed retriable commands helps get a magnitude of the problem.
|
#
c37fc318 |
|
19-Jul-2019 |
Warner Losh <imp@FreeBSD.org> |
Keep track of the number of retried commands. Retried commands can indicate a performance degredation of an nvme drive. Keep track of the number of retries and report it out via sysctl, just like number of commands an interrupts.
|
#
1071b50a |
|
18-Jul-2019 |
Warner Losh <imp@FreeBSD.org> |
Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump. Also convert it to a bool. While the rest of the driver isn't yet bool clean, this will help. Reviewed by: cem@ Differential Revision: https://reviews.freebsd.org/D20988
|
#
c75bdc04 |
|
18-Jul-2019 |
Warner Losh <imp@FreeBSD.org> |
Provide new tunable hw.nvme.verbose_cmd_dump The nvme drive dumps only the most relevant details about a command when it fails. However, there are times this is not sufficient (such as debugging weird issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1 in loader.conf will enable more complete debugging information about each command that fails. Reviewed by: rpokala Sponsored by: Netflix Differential Version: https://reviews.freebsd.org/D20988
|
#
2ffd6fce |
|
08-Mar-2019 |
Warner Losh <imp@FreeBSD.org> |
Don't print all the I/O we abort on a reset, unless we're out of retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431
|
#
45d7e233 |
|
27-Feb-2019 |
Warner Losh <imp@FreeBSD.org> |
Unconditionally support unmapped BIOs. This was another shim for supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point.
|
#
d706306d |
|
27-Feb-2019 |
Warner Losh <imp@FreeBSD.org> |
Remove #ifdef code to support FreeBSD versions that haven't been supported in years. A number of changes have been made to the driver that likely wouldn't work on those older versions that aren't properly ifdef'd and it's project policy to GC such code once it is stale.
|
#
09efa3df |
|
26-Oct-2018 |
Warner Losh <imp@FreeBSD.org> |
Put a workaround in for command timeout malfunctioning At least one NVMe drive has a bug that makeing the Command Time Out PCIe feature unreliable. The workaround is to disable this feature. The driver wouldn't deal correctly with a timeout anyway. Only do this for drives that are known bad. Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D17708
|
#
f439e3a4 |
|
24-May-2018 |
Alexander Motin <mav@FreeBSD.org> |
Refactor NVMe CAM integration. - Remove layering violation, when NVMe SIM code accessed CAM internal device structures to set pointers on controller and namespace data. Instead make NVMe XPT probe fetch the data directly from hardware. - Cleanup NVMe SIM code, fixing support for multiple namespaces per controller (reporting them as LUNs) and adding controller detach support and run-time namespace change notifications. - Add initial support for namespace change async events. So far only in CAM mode, but it allows run-time namespace arrival and departure. - Add missing nvme_notify_fail_consumers() call on controller detach. Together with previous changes this allows NVMe device detach/unplug. Non-CAM mode still requires a lot of love to stay on par, but at least CAM mode code should not stay in the way so much, becoming much more self-sufficient. Reviewed by: imp MFC after: 1 month Sponsored by: iXsystems, Inc.
|
#
d85d9648 |
|
15-Mar-2018 |
Warner Losh <imp@FreeBSD.org> |
Try polling the qpairs on timeout. On some systems, we're getting timeouts when we use multiple queues on drives that work perfectly well on other systems. On a hunch, Jim Harris suggested I poll the completion queue when we get a timeout. This patch polls the completion queue if no fatal status was indicated. If it had pending I/O, we complete that request and return. Otherwise, if aborts are enabled and no fatal status, we abort the command and return. Otherwise we reset the card. This may clear up the problem, or we may see it result in lots of timeouts and a performance problem. Either way, we'll know the next step. We may also need to pay attention to the fatal status bit of the controller. PR: 211713 Suggested by: Jim Harris Sponsored by: Netflix
|
#
0d787e9b |
|
22-Feb-2018 |
Wojciech Macek <wma@FreeBSD.org> |
NVMe: Add big-endian support Remove bitfields from defined structures as they are not portable. Instead use shift and mask macros in the driver and nvmecontrol application. NVMe is now working on powerpc64 host. Submitted by: Michal Stanek <mst@semihalf.com> Obtained from: Semihalf Reviewed by: imp, wma Sponsored by: IBM, QCM Technologies Differential revision: https://reviews.freebsd.org/D13916
|
#
29077eb4 |
|
28-Jan-2018 |
Warner Losh <imp@FreeBSD.org> |
Use atomic load and stores to ensure that the compiler doesn't optimize away these loops. Change boolean to int to match what atomic API supplies. Remove wmb() since the atomic_store_rel() on status.done ensure the prior writes to status. It also fixes the fact that there wasn't a rmb() before reading done. This should also be more efficient since wmb() is fairly heavy weight. Sponsored by: Netflix Reviewed by: kib@, jim harris Differential Revision: https://reviews.freebsd.org/D14053
|
#
ce1ec9c1 |
|
18-Dec-2017 |
Warner Losh <imp@FreeBSD.org> |
When we're disabling the nvme device, some drives have a controller bug that requires 'hands off' for a period of time (2.3s) before we check the RDY bit. Sicne this is a very odd quirk for a very limited selection of drives, do this as a quirk. This prevented a successful reset of the card when the card wedged. Also, make sure that we comply with the advice from section 3.1.5 of the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has undefined results". Short circuit when EN == RDY == desired state. Finally, fail the reset if the disable fails. This will lead to a failed device, which is what we want. (note: nda device needs work for coping with a failed device). Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D13389
|
#
718cf2cc |
|
27-Nov-2017 |
Pedro F. Giffuni <pfg@FreeBSD.org> |
sys/dev: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.
|
#
bb1c7be4 |
|
15-Oct-2017 |
Warner Losh <imp@FreeBSD.org> |
Create general polling function for the nvme controller. Use it when we're doing the various pin-based interrupt modes. Adjust nvme_ctrlr_intx_handler to use nvme_ctrlr_poll. Sponsored by: Netflix Suggested by: scottl@
|
#
51977281 |
|
29-Aug-2017 |
Warner Losh <imp@FreeBSD.org> |
Add CAM/NVMe support for CAM_DATA_SG This adds support in pass(4) for data to be described with a scatter-gather list (sglist) to augment the existing (single) virtual address. Differential Revision: https://reviews.freebsd.org/D11361 Submitted by: Chuck Tuffli Reviewed by: imp@, scottl@, kenm@
|
#
c02565f9 |
|
28-Aug-2017 |
Warner Losh <imp@FreeBSD.org> |
Set the max transactions for NVMe drives better. Provided a better estimate for the number of transactions that can be pending at one time. This will be number of queues * number of trackers / 4, as suggested by Jim Harris. This gives a better estimate of the number of transactions that CAM should queue before applying back pressure. This should be revisted when we have real multi-queue support in CAM and the upper layers of the I/O stack. Sponsored by: Netflix
|
#
696c9502 |
|
25-Aug-2017 |
Warner Losh <imp@FreeBSD.org> |
NVME Namespace ID is 32-bits, so widen interface to reflect that. Sponsored by: Netflix
|
#
a965389b |
|
07-Nov-2016 |
Scott Long <scottl@FreeBSD.org> |
Convert the Q-Pair and PRP list memory allocations to use BUSDMA. Add a bunch of safery belts and error handling in related codepaths. Reviewed by: jimharris Obtained from: Netflix Differential Revision: D8453
|
#
3a31c31c |
|
20-Jul-2016 |
Warner Losh <imp@FreeBSD.org> |
Actually import nvme_sim so the CAM attachment for NVME (nda) actually works. MFC after: 1 week
|
#
f24c011b |
|
10-Jun-2016 |
Warner Losh <imp@FreeBSD.org> |
Commit the bits of nda that were missed. This should fix the build. Approved by: re@
|
#
2b647da7 |
|
07-Jan-2016 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: do not revert o single I/O queue when per-CPU queues not possible Previously nvme(4) would revert to a signle I/O queue if it could not allocate enought interrupt vectors or NVMe submission/completion queues to have one I/O queue per core. This patch determines how to utilize a smaller number of available interrupt vectors, and assigns (as closely as possible) an equal number of cores to each associated I/O queue. MFC after: 3 days Sponsored by: Intel
|
#
e5af5854 |
|
07-Jan-2016 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: do not pre-allocate MSI-X IRQ resources The issue referenced here was resolved by other changes in recent commits, so this code is no longer needed. MFC after: 3 days Sponsored by: Intel
|
#
c75ad8ce |
|
07-Jan-2016 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: remove per_cpu_io_queues from struct nvme_controller Instead just use num_io_queues to make this determination. This prepares for some future changes enabling use of multiple queues when we do not have enough queues or MSI-X vectors for one queue per CPU. MFC after: 3 days Sponsored by: Intel
|
#
36b0e4ee |
|
08-Apr-2015 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: remove CHATHAM related code Chatham was an internal NVMe prototype board used for early driver development. MFC after: 1 week Sponsored by: Intel
|
#
a6e30963 |
|
08-Apr-2015 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: create separate DMA tag for non-payload DMA buffers Submission and completion queue memory need to use a separate DMA tag for mappings than payload buffers, to ensure mappings remain contiguous even with DMAR enabled. Submitted by: kib MFC after: 1 week Sponsored by: Intel
|
#
f42ca756 |
|
18-Mar-2014 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: Allocate all MSI resources up front so that we can fall back to INTx if necessary. Sponsored by: Intel MFC after: 3 days
|
#
496a2752 |
|
18-Mar-2014 |
Jim Harris <jimharris@FreeBSD.org> |
nvme: Close hole where nvd(4) would not be notified of all nvme(4) instances if modules loaded during boot. Sponsored by: Intel MFC after: 3 days
|
#
bb2f67fd |
|
08-Oct-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Log and then disable asynchronous notification of persistent events after they occur. This prevents repeated notifications of the same event. Status of these events may be viewed at any time by viewing the SMART/Health Info Page using nvmecontrol, whether or not asynchronous events notifications for those events are enabled. This log page can be viewed using: nvmecontrol logpage -p 2 <ctrlr id> Future enhancements may re-enable these notifications on a periodic basis so that if the notified condition persists, it will continue to be logged. Sponsored by: Intel Reviewed by: carl Approved by: re (hrs) MFC after: 1 week
|
#
a40e72a6 |
|
08-Oct-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add driver-assisted striping for upcoming Intel NVMe controllers that can benefit from it. Sponsored by: Intel Reviewed by: kib (earlier version), carl Approved by: re (hrs) MFC after: 1 week
|
#
56183abc |
|
13-Aug-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Send a shutdown notification in the driver unload path, to ensure notification gets sent in cases where system shuts down with driver unloaded. Sponsored by: Intel Reviewed by: carl MFC after: 3 days
|
#
bd6b0ac5 |
|
09-Jul-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add comment explaining why CACHE_LINE_SIZE is defined in nvme_private.h if not already defined elsewhere. Requested by: attilio MFC after: 3 days
|
#
e9efbc13 |
|
09-Jul-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Update copyright dates. MFC after: 3 days
|
#
bbd412dd |
|
26-Jun-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Remove remaining uio-related code. The nvme_physio() function was removed quite a while ago, which was the only user of this uio-related code. Sponsored by: Intel MFC after: 3 days
|
#
8d09e3c4 |
|
26-Jun-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Use MAXPHYS to specify the maximum I/O size for nvme(4). Also allow admin commands to transfer up to this maximum I/O size, rather than the artificial limit previously imposed. The larger I/O size is very beneficial for upcoming firmware download support. This has the added benefit of simplifying the code since both admin and I/O commands now use the same maximum I/O size. Sponsored by: Intel MFC after: 3 days
|
#
ca269f32 |
|
12-Apr-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Move the busdma mapping functions to nvme_qpair.c. This removes nvme_uio.c completely. Sponsored by: Intel
|
#
97fafe25 |
|
12-Apr-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add a mutex to each namespace, for general locking operations on the namespace. Sponsored by: Intel
|
#
a90b8104 |
|
12-Apr-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Rename the controller's fail_req_lock, so that it can be used for other locking operations on the controller. Sponsored by: Intel
|
#
5fdf9c3c |
|
01-Apr-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add unmapped bio support to nvme(4) and nvd(4). Sponsored by: Intel
|
#
1e526bc4 |
|
29-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add "type" to nvme_request, signifying if its payload is a VADDR, UIO, or NULL. This simplifies decisions around if/how requests are routed through busdma. It also paves the way for supporting unmapped bios. Sponsored by: Intel
|
#
547d523e |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Clean up debug prints. 1) Consistently use device_printf. 2) Make dump_completion and dump_command into something more human-readable. Sponsored by: Intel Reviewed by: carl
|
#
dd433dd0 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Move common code from the different nvme_allocate_request functions into a separate function. Sponsored by: Intel Suggested by: carl Reviewed by: carl
|
#
955910a9 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Replace usages of mtx_pool_find used for admin commands with a polling mechanism. Now that all requests are timed, we are guaranteed to get a completion notification, even if it is an abort status due to a timed out admin command. This has the effect of simplifying the controller and namespace setup code, so that it reads straight through rather than broken up into a bunch of different callback functions. Sponsored by: Intel Reviewed by: carl
|
#
232e2edb |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add the ability to internally mark a controller as failed, if it is unable to start or reset. Also add a notifier for NVMe consumers for controller fail conditions and plumb this notifier for nvd(4) to destroy the associated GEOM disks when a failure occurs. This requires a bit of work to cover the races when a consumer is sending I/O requests to a controller that is transitioning to the failed state. To help cover this condition, add a task to defer completion of I/Os submitted to a failed controller, so that the consumer will still always receive its completions in a different context than the submission. Sponsored by: Intel Reviewed by: carl
|
#
be34f216 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Remove the is_started flag from struct nvme_controller. This flag was originally added to communicate to the sysctl code which oids should be built, but there are easier ways to do this. This needs to be cleaned up prior to adding new controller states - for example, controller failure. Sponsored by: Intel Reviewed by: carl
|
#
02e33484 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Ensure the controller's MDTS is accounted for in max_xfer_size. The controller's IDENTIFY data contains MDTS (Max Data Transfer Size) to allow the controller to specify the maximum I/O data transfer size. nvme(4) already provides a default maximum, but make sure it does not exceed what MDTS reports. Sponsored by: Intel Reviewed by: carl
|
#
cb5b7c13 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Cap the number of retry attempts to a configurable number. This ensures that if a specific I/O repeatedly times out, we don't retry it indefinitely. The default number of retries will be 4, but is adjusted using hw.nvme.retry_count. Sponsored by: Intel Reviewed by: carl
|
#
0d7e13ec |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Pass associated log page data to async event consumers, if requested. Sponsored by: Intel Reviewed by: carl
|
#
2868353a |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
When an asynchronous event request is completed, automatically fetch the specified log page. This satisfies the spec condition that future async events of the same type will not be sent until the associated log page is fetched. Sponsored by: Intel Reviewed by: carl
|
#
0692579b |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add structure definitions and controller command function for firmware log pages. Sponsored by: Intel Reviewed by: carl
|
#
08927782 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add structure definitions and a controller command function for error log pages. Sponsored by: Intel Reviewed by: carl
|
#
f37c22a3 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Make nvme_ctrlr_reset a nop if a reset is already in progress. This protects against cases where a controller crashes with multiple I/O outstanding, each timing out and requesting controller resets simultaneously. While here, remove a debugging printf from a previous commit, and add more logging around I/O that need to be resubmitted after a controller reset. Sponsored by: Intel Reviewed by: carl
|
#
48ce3178 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
By default, always escalate to controller reset when an I/O times out. While aborts are typically cleaner than a full controller reset, many times an I/O timeout indicates other controller-level issues where aborts may not work. NVMe drivers for other operating systems are also defaulting to controller reset rather than aborts for timed out I/O. Sponsored by: Intel Reviewed by: carl
|
#
94143332 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add a tunable for the I/O timeout interval. Default is still 30 seconds, but can be adjusted between a min/max of 5 and 120 seconds. Sponsored by: Intel Reviewed by: carl
|
#
12d191ec |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add handling for controller fatal status (csts.cfs). On any I/O timeout, check for csts.cfs==1. If set, the controller is reporting fatal status and we reset the controller immediately, rather than trying to abort the timed out command. This changeset also includes deferring the controller start portion of the reset to a separate task. This ensures we are always performing a controller start operation from a consistent context. Sponsored by: Intel Reviewed by: carl
|
#
b846efd7 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add controller reset capability to nvme(4) and ability to explicitly invoke it from nvmecontrol(8). Controller reset will be performed in cases where I/O are repeatedly timing out, the controller reports an unrecoverable condition, or when explicitly requested via IOCTL or an nvme consumer. Since the controller may be in such a state where it cannot even process queue deletion requests, we will perform a controller reset without trying to clean up anything on the controller first. Sponsored by: Intel Reviewed by: carl
|
#
65c2474e |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Keep a doubly-linked list of outstanding trackers. This enables in-order re-submission of I/O after a controller reset. Sponsored by: Intel
|
#
99d99f74 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Expose the get/set features API to nvme consumers. Sponsored by: Intel
|
#
038a5ee4 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add an interface for nvme shim drivers (i.e. nvd) to register for notifications when new nvme controllers are added to the system. Sponsored by: Intel
|
#
0a0b08cc |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Enable asynchronous event requests on non-Chatham devices. Also add logic to clean up all outstanding asynchronous event requests when resetting or shutting down the controller, since these requests will not be explicitly completed by the controller itself. Sponsored by: Intel
|
#
990e741c |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Move controller destruction code from nvme_detach() to new nvme_ctrlr_destruct() function. Sponsored by: Intel
|
#
274b3a88 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Specify command timeout interval on a per-command type basis. This is primarily driven by the need to disable timeouts for asynchronous event requests, which by nature should not be timed out. Sponsored by: Intel
|
#
448195e7 |
|
26-Mar-2013 |
Jim Harris <jimharris@FreeBSD.org> |
Add support for ABORT commands, including issuing these commands when an I/O times out. Also ensure that we retry commands that are aborted due to a timeout. Sponsored by: Intel
|
#
91fe20e3 |
|
18-Dec-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Map BAR 4/5, because NVMe spec says devices may place the MSI-X table behind BAR 4/5, rather than in BAR 0/1 with the control/doorbell registers. Sponsored by: Intel
|
#
4d6abcb1 |
|
18-Dec-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Do not use taskqueue to defer completion work when using INTx. INTx now matches MSI-X behavior. Sponsored by: Intel
|
#
38ce9496 |
|
06-Dec-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Add PCI device ID for 8-channel IDT NVMe controller, and clarify that the previously defined IDT PCI device ID was for a 32-channel controller. Submitted by: Joe Golio <joseph.golio@isilon.com>
|
#
0f71ecf7 |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Add ability to queue nvme_request objects if no nvme_trackers are available. This eliminates the need to manage queue depth at the nvd(4) level for Chatham prototype board workarounds, and also adds the ability to accept a number of requests on a single qpair that is much larger than the number of trackers allocated. Sponsored by: Intel
|
#
21b6da58 |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Preallocate a limited number of nvme_tracker objects per qpair, rather than dynamically creating them at runtime. Sponsored by: Intel
|
#
5ae9ed68 |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Create nvme_qpair_submit_request() which eliminates all of the code duplication between the admin and io controller-level submit functions. Sponsored by: Intel
|
#
5fa5cc5f |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Cleanup uio-related code to use struct nvme_request and nvme_ctrlr_submit_io_request(). While here, also fix case where a uio may have more than 1 iovec. NVMe's definition of SGEs (called PRPs) only allows for the first SGE to start on a non-page boundary. The simplest way to handle this is to construct a temporary uio for each iovec, and submit an NVMe request for each. Sponsored by: Intel
|
#
d281e8fb |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Add nvme_ctrlr_submit_[admin|io]_request functions which consolidates code for allocating nvme_tracker objects and making calls into bus_dmamap_load for commands which have payloads. Sponsored by: Intel
|
#
ad697276 |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Add struct nvme_request object which contains all of the parameters passed from an NVMe consumer. This allows us to mostly build NVMe command buffers without holding the qpair lock, and also allows for future queueing of nvme_request objects in cases where the submission queue is full and no nvme_tracker objects are available. Sponsored by: Intel
|
#
f2b19f67 |
|
17-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Merge struct nvme_prp_list into struct nvme_tracker. This simplifies the driver significantly where it is constructing commands to be submitted to hardware. By reducing the number of PRPs (NVMe parlance for SGE) from 128 to 32, it ensures we do not allocate too much memory for more common smaller I/O sizes, while still supporting up to 128KB I/O sizes. This also paves the way for pre-allocation of nvme_tracker objects for each queue which will simplify the I/O path even further. Sponsored by: Intel
|
#
6568ebfc |
|
10-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Count number of times each queue pair's interrupt handler is invoked. Also add sysctls to query and reset each queue pair's stats, including the new count added here. Sponsored by: Intel
|
#
8bed48f2 |
|
10-Oct-2012 |
Jim Harris <jimharris@FreeBSD.org> |
Put the nvme_qpair mutex on its own cacheline. Sponsored by: Intel
|
#
bb0ec6b3 |
|
17-Sep-2012 |
Jim Harris <jimharris@FreeBSD.org> |
This is the first of several commits which will add NVM Express (NVMe) support to FreeBSD. A full description of the overall functionality being added is below. nvmexpress.org defines NVM Express as "an optimized register interface, command set and feature set fo PCI Express (PCIe)-based Solid-State Drives (SSDs)." This commit adds nvme(4) and nvd(4) driver source code and Makefiles to the tree. Full NVMe functionality description: Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe) device support. There will continue to be ongoing work on NVM Express support, but there is more than enough to allow for evaluation of pre-production NVM Express devices as well as soliciting feedback. Questions and feedback are welcome. nvme(4) implements NVMe hardware abstraction and is a provider of NVMe namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN. nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks. nvmecontrol(8) is used for NVMe configuration and management. The following are currently supported: nvme(4) - full mandatory NVM command set support - per-CPU IO queues (enabled by default but configurable) - per-queue sysctls for statistics and full command/completion queue dumps for debugging - registration API for NVMe namespace consumers - I/O error handling (except for timeoutsee below) - compilation switches for support back to stable-7 nvd(4) - BIO_DELETE and BIO_FLUSH (if supported by controller) - proper BIO_ORDERED handling nvmecontrol(8) - devlist: list NVMe controllers and their namespaces - identify: display controller or namespace identify data in human-readable or hex format - perftest: quick and dirty performance test to measure raw performance of NVMe device without userspace/physio/GEOM overhead The following are still work in progress and will be completed over the next 3-6 months in rough priority order: - complete man pages - firmware download and activation - asynchronous error requests - command timeout error handling - controller resets - nvmecontrol(8) log page retrieval This has been primarily tested on amd64, with light testing on i386. I would be happy to provide assistance to anyone interested in porting this to other architectures, but am not currently planning to do this work myself. Big-endian and dmamap sync for command/completion queues are the main areas that would need to be addressed. The nvme(4) driver currently has references to Chatham, which is an Intel-developed prototype board which is not fully spec compliant. These references will all be removed over time. Sponsored by: Intel Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>
|