344033 |
12-Feb-2019 |
mav |
MFC r343562, r343563: Reimplement BIO_ORDERED handling in nvd(4).
This fixes BIO_ORDERED semantics while also improving performance by: - sleeping also before BIO_ORDERED bio, as defined, not only after; - not queueing BIO_ORDERED bio to taskqueue if no other bios running; - waking up sleeping taskqueue explicitly rather then rely on polling.
On Samsung SSD 970 PRO this shows sync write latency, measured with `diskinfo -wS`, reduction from ~2ms to ~1.1ms by not sleeping without reason till next HZ tick.
On the same device ZFS pool with 8 ZVOLs synchronously writing 4KB blocks shows ~950 IOPS instead of ~750 IOPS before. I suspect ZFS does not need BIO_ORDERED on BIO_FLUSH at all, but that will be next question. |
343447 |
25-Jan-2019 |
mav |
MFC r342557, r342559: Reimplement nvd(4) detach handling.
Previous code typically crashed in case of NVMe device unplug or even clean detach while some I/Os are still in flight. To fix this the new code calls disk_gone() and waits for confirmation of all references gone before calling disk_destroy(), freeing other resources and allowing controller detach.
While there, fix disk lists locking and reimplement unit numbers assignment. |
295022 |
28-Jan-2016 |
jimharris |
nvd: add hw.nvd.delete_max tunable
The NVMe specification does not define a maximum or optimal delete size, so technically max delete size is min(full size of namespace, 2^32 - 1 LBAs). A single delete operation for a multi-TB NVMe namespace though may take much longer to complete than the nvme(4) I/O timeout period. So choose a sensible default here that is still suitably large to minimize the number of overall delete operations.
This also fixes possible uint32_t overflow on initial TRIM operation for zpool create operations for NVMe namespaces with >4G LBAs.
MFC after: 3 days Sponsored by: Intel
|
292074 |
11-Dec-2015 |
smh |
Limit stripesize reported from nvd(4) to 4K
Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.
This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.
This can be controller by the new sysctl kern.nvme.max_optimal_sectorsize.
MFC after: 1 week Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D4446
|
248767 |
26-Mar-2013 |
jimharris |
Add the ability to internally mark a controller as failed, if it is unable to start or reset. Also add a notifier for NVMe consumers for controller fail conditions and plumb this notifier for nvd(4) to destroy the associated GEOM disks when a failure occurs.
This requires a bit of work to cover the races when a consumer is sending I/O requests to a controller that is transitioning to the failed state. To help cover this condition, add a task to defer completion of I/Os submitted to a failed controller, so that the consumer will still always receive its completions in a different context than the submission.
Sponsored by: Intel Reviewed by: carl
|
248756 |
26-Mar-2013 |
jimharris |
Create struct nvme_status.
NVMe error log entries include status, so breaking this out into its own data structure allows it to be included in both the nvme_completion data structure as well as error log entry data structures.
While here, expose nvme_completion_is_error(), and change all of the places that were explicitly looking at sc/sct bits to use this macro instead.
Sponsored by: Intel Reviewed by: carl
|
240616 |
17-Sep-2012 |
jimharris |
This is the first of several commits which will add NVM Express (NVMe) support to FreeBSD. A full description of the overall functionality being added is below. nvmexpress.org defines NVM Express as "an optimized register interface, command set and feature set fo PCI Express (PCIe)-based Solid-State Drives (SSDs)."
This commit adds nvme(4) and nvd(4) driver source code and Makefiles to the tree.
Full NVMe functionality description: Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe) device support.
There will continue to be ongoing work on NVM Express support, but there is more than enough to allow for evaluation of pre-production NVM Express devices as well as soliciting feedback. Questions and feedback are welcome.
nvme(4) implements NVMe hardware abstraction and is a provider of NVMe namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN. nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks. nvmecontrol(8) is used for NVMe configuration and management.
The following are currently supported: nvme(4) - full mandatory NVM command set support - per-CPU IO queues (enabled by default but configurable) - per-queue sysctls for statistics and full command/completion queue dumps for debugging - registration API for NVMe namespace consumers - I/O error handling (except for timeoutsee below) - compilation switches for support back to stable-7
nvd(4) - BIO_DELETE and BIO_FLUSH (if supported by controller) - proper BIO_ORDERED handling
nvmecontrol(8) - devlist: list NVMe controllers and their namespaces - identify: display controller or namespace identify data in human-readable or hex format - perftest: quick and dirty performance test to measure raw performance of NVMe device without userspace/physio/GEOM overhead
The following are still work in progress and will be completed over the next 3-6 months in rough priority order: - complete man pages - firmware download and activation - asynchronous error requests - command timeout error handling - controller resets - nvmecontrol(8) log page retrieval
This has been primarily tested on amd64, with light testing on i386. I would be happy to provide assistance to anyone interested in porting this to other architectures, but am not currently planning to do this work myself. Big-endian and dmamap sync for command/completion queues are the main areas that would need to be addressed.
The nvme(4) driver currently has references to Chatham, which is an Intel-developed prototype board which is not fully spec compliant. These references will all be removed over time.
Sponsored by: Intel Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>
|