History log of /freebsd-current/sys/dev/nvme/nvme_sysctl.c
Revision Date Author Comments
# 8d6c0743 06-Nov-2023 Alexander Motin <mav@FreeBSD.org>

nvme: Introduce longer timeouts for admin queue

KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.

Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454


# 8052b01e 25-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Add exclusion for ISR

Add a basically uncontended spinlock that we take out while the ISR is
running. This has two effects: First, when we get a timeout, we can
safely call the nvme_qpair_process_completions w/o racing any ISRs.
Second, we can use it to ensure that we don't reset the card while
the ISRs are active (right now we just sleep and hope for the best,
which usually is fine, but not always).

Sponsored by: Netflix
MFC After: 2 weeks
Reviewed by: chuck, gallatin
Differential Revision: https://reviews.freebsd.org/D41452


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 7be0b068 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate command printing routine

Both nvme_dump_command and nvme_qpair_print_command print nvme
commands. The former latter better. Recode the one call to
nvme_dump_command to use nvme_qpair_print_command and delete the
former. No sense having two nearly identical routines. A future commit
will convert to sbuf.

Sponsored by: Netflix
Reviewed by: chuck, mav, jhb
Differential Revision: https://reviews.freebsd.org/D41309


# 6f76d493 07-Aug-2023 Warner Losh <imp@FreeBSD.org>

nvme: Remove duplicate completion printing routine

Both nvme_dump_completion and nvme_qpair_print_completion print
completions. The latter is better. Recode the two instances of
nvme_dump_completion to use nvme_qpair_print_completion and delete the
former. No sense having two nearly identical routines. A future commit
will convert this to sbuf.

Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D41308


# bdc81eed 12-Jun-2023 Warner Losh <imp@FreeBSD.org>

nvme: Switch to nda by default

We already run nda by default on all the !x86 architectures. Switch the
default to nda. nda created nvd compatibility links by default, so this
should be a nop. If this causes problems for your application, set
hw.nvme.use_nvd=1 in your loader.conf.

Sponsored by: Netflix


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 161fcf79 29-Mar-2022 Warner Losh <imp@FreeBSD.org>

nvme: Publish the drive's capabilities

Add cap_lo and cap_hi sysctl to each nvme drive. This publishes the raw
capabilities of the drive. Now we can only discover these with
bootverbose.

Sponsored by: Netflix


# 5f8ccf65 30-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

nvme(4): Correct a typo in a sysctl description

- s/printting/printing/

MFC after: 3 days


# 587aa255 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: count number of ignored interrupts

Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.

Sponsored by: Netflix
Reviewed by: mav, chuck
Differential Revision: https://reviews.freebsd.org/D32212


# 7d5eebe0 28-Sep-2021 Warner Losh <imp@FreeBSD.org>

nvme: Add sanity check for phase on startup.

The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.

Sponsored by: Netflix
Reviewed by: cperciva, mav (prior version)
Differential Revision: https://reviews.freebsd.org/D32211


# b776de67 10-Aug-2021 Alexander Motin <mav@FreeBSD.org>

Mark some sysctls as CTLFLAG_MPSAFE.

MFC after: 2 weeks


# 0fc1d208 23-Oct-2020 Warner Losh <imp@FreeBSD.org>

nvme: Remove compat code for older kernels

Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined
in rev 217616. All supported branches have it, so remove its compat
definition as OBE.


# d87b31e1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

nvme: clean up empty lines in .c and .h files


# 4053f8ac 02-May-2020 David Bright <dab@FreeBSD.org>

Fix various Coverity-detected errors in nvme driver

This fixes several Coverity-detected errors in the nvme driver.

CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975,
1403980

Reviewed by: imp@, vangyzen@
MFC after: 5 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D24532


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 1eab19cb 23-Sep-2019 Alexander Motin <mav@FreeBSD.org>

Make nvme(4) driver some more NUMA aware.

- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.

MFC after: 1 month
Sponsored by: iXsystems, Inc.


# 5e83c2ff 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of commands that exhaust their retry limit.

While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.


# c37fc318 19-Jul-2019 Warner Losh <imp@FreeBSD.org>

Keep track of the number of retried commands.

Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.


# 1071b50a 18-Jul-2019 Warner Losh <imp@FreeBSD.org>

Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump.

Also convert it to a bool. While the rest of the driver isn't yet bool clean,
this will help.

Reviewed by: cem@
Differential Revision: https://reviews.freebsd.org/D20988


# 718cf2cc 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/dev: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 850564b9 28-Aug-2017 Warner Losh <imp@FreeBSD.org>

Add new compile-time option NVME_USE_NVD that sets the default value
of the runtime hw.nvme.use_vnd tunable. We still default to nvd unless
otherwise requested.

Sponsored by: Netflix


# 8a5d94f9 03-Aug-2017 Warner Losh <imp@FreeBSD.org>

Make nvd vs nda choice boot-time rather than build-time

Introduce hw.nvme.use_nvd tunable. This tunable allows both nvd and
nda to be installed in the kernel, while allowing only one of them to
create devices. This is an all-or-nothing setting, and you can't
change it after boot-time. However, it will allow easier A/B testing.

Differential Revision: https://reviews.freebsd.org/D11825


# ee7f4d81 10-Mar-2016 Alexander Motin <mav@FreeBSD.org>

Revert r292074 (by smh): Limit stripesize reported from nvd(4) to 4K

I believe that this patch handled the problem from the wrong side.
Instead of making ZFS properly handle large stripe sizes, it made
unrelated driver to lie in reported parameters to workaround that.

Alternative solution for this problem from ZFS side was committed at
r296615.

Discussed with: smh


# 50dea2da 07-Jan-2016 Jim Harris <jimharris@FreeBSD.org>

nvme: add hw.nvme.min_cpus_per_ioq tunable

Due to FreeBSD system-wide limits on number of MSI-X vectors
(https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321),
it may be desirable to allocate fewer than the maximum number
of vectors for an NVMe device, in order to save vectors for
other devices (usually Ethernet) that can take better
advantage of them and may be probed after NVMe.

This tunable is expressed in terms of minimum number of CPUs
per I/O queue instead of max number of queues per controller,
to allow for a more even distribution of CPUs per queue. This
avoids cases where some number of CPUs have a dedicated queue,
but other CPUs need to share queues. Ideally the PR referenced
above will eventually be fixed and the mechanism implemented
here becomes obsolete anyways.

While here, fix a bug in the CPUs per I/O queue calculation to
properly account for the admin queue's MSI-X vector.

Reviewed by: gallatin
MFC after: 3 days
Sponsored by: Intel


# fdf16a68 10-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Limit stripesize reported from nvd(4) to 4K

Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.

This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.

This can be controller by the new sysctl kern.nvme.max_optimal_sectorsize.

MFC after: 1 week
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4446


# e9efbc13 09-Jul-2013 Jim Harris <jimharris@FreeBSD.org>

Update copyright dates.

MFC after: 3 days


# be34f216 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Remove the is_started flag from struct nvme_controller.

This flag was originally added to communicate to the sysctl code
which oids should be built, but there are easier ways to do this. This
needs to be cleaned up prior to adding new controller states - for example,
controller failure.

Sponsored by: Intel
Reviewed by: carl


# 94143332 26-Mar-2013 Jim Harris <jimharris@FreeBSD.org>

Add a tunable for the I/O timeout interval. Default is still 30 seconds,
but can be adjusted between a min/max of 5 and 120 seconds.

Sponsored by: Intel
Reviewed by: carl


# 21b6da58 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Preallocate a limited number of nvme_tracker objects per qpair, rather
than dynamically creating them at runtime.

Sponsored by: Intel


# f2b19f67 17-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Merge struct nvme_prp_list into struct nvme_tracker.

This simplifies the driver significantly where it is constructing
commands to be submitted to hardware. By reducing the number of
PRPs (NVMe parlance for SGE) from 128 to 32, it ensures we do not
allocate too much memory for more common smaller I/O sizes, while
still supporting up to 128KB I/O sizes.

This also paves the way for pre-allocation of nvme_tracker objects
for each queue which will simplify the I/O path even further.

Sponsored by: Intel


# 6568ebfc 10-Oct-2012 Jim Harris <jimharris@FreeBSD.org>

Count number of times each queue pair's interrupt handler is invoked.

Also add sysctls to query and reset each queue pair's stats, including
the new count added here.

Sponsored by: Intel


# bb0ec6b3 17-Sep-2012 Jim Harris <jimharris@FreeBSD.org>

This is the first of several commits which will add NVM Express (NVMe)
support to FreeBSD. A full description of the overall functionality
being added is below. nvmexpress.org defines NVM Express as "an optimized
register interface, command set and feature set fo PCI Express (PCIe)-based
Solid-State Drives (SSDs)."

This commit adds nvme(4) and nvd(4) driver source code and Makefiles
to the tree.

Full NVMe functionality description:
Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe)
device support.

There will continue to be ongoing work on NVM Express support, but there
is more than enough to allow for evaluation of pre-production NVM Express
devices as well as soliciting feedback. Questions and feedback are welcome.

nvme(4) implements NVMe hardware abstraction and is a provider of NVMe
namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN.
nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks.
nvmecontrol(8) is used for NVMe configuration and management.

The following are currently supported:
nvme(4)
- full mandatory NVM command set support
- per-CPU IO queues (enabled by default but configurable)
- per-queue sysctls for statistics and full command/completion queue
dumps for debugging
- registration API for NVMe namespace consumers
- I/O error handling (except for timeoutsee below)
- compilation switches for support back to stable-7

nvd(4)
- BIO_DELETE and BIO_FLUSH (if supported by controller)
- proper BIO_ORDERED handling

nvmecontrol(8)
- devlist: list NVMe controllers and their namespaces
- identify: display controller or namespace identify data in
human-readable or hex format
- perftest: quick and dirty performance test to measure raw
performance of NVMe device without userspace/physio/GEOM
overhead

The following are still work in progress and will be completed over the
next 3-6 months in rough priority order:
- complete man pages
- firmware download and activation
- asynchronous error requests
- command timeout error handling
- controller resets
- nvmecontrol(8) log page retrieval

This has been primarily tested on amd64, with light testing on i386. I
would be happy to provide assistance to anyone interested in porting
this to other architectures, but am not currently planning to do this
work myself. Big-endian and dmamap sync for command/completion queues
are the main areas that would need to be addressed.

The nvme(4) driver currently has references to Chatham, which is an
Intel-developed prototype board which is not fully spec compliant.
These references will all be removed over time.

Sponsored by: Intel
Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>