History log of /freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
359554 02-Apr-2020 mav

MFC r359112: MFOpenZFS: make zil max block size tunable

We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
openzfs/zfs@b8738257c2607c73c731ce8e0fd73282b266d6ef

358313 25-Feb-2020 mav

MFC r349381: Avoid extra taskq_dispatch() calls by DMU.

DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

353759 19-Oct-2019 avg

MFC r353037: ZFS: add bookmark renaming

347047 03-May-2019 mav

MFC r346760: Fix minor mismerges.

No functional change.

346686 25-Apr-2019 mav

MFC r345200: MFV r336930: 9284 arc_reclaim_thread has 2 jobs

`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.

The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.

The fix is to use seperate threads to:

1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`

2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.

illumos/illumos-gate@de753e34f9c399037936e8bc547d823bba9d4b0d

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>

346676 25-Apr-2019 mav

MFC r337594 (by mmacy):
ZFS/MFV: Use cached feature info in spa_add_feature_stats()

commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date: Thu Feb 26 12:24:11 2015 -0800

Use cached feature info in spa_add_feature_stats()

Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082

346131 11-Apr-2019 mav

MFC r344936: MFV/ZoL: Disable LBA weighting on files and SSDs

The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f0853946f8150481ca22602d1334dfe

To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.

339158 03-Oct-2018 mav

MFC r337567 (by mmacy):
Performance optimization of AVL tree comparator functions

MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date: Sat Aug 27 20:12:53 2016 +0200

perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old 6.85789 s
new 2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals

perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033


/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_iter.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_sendrecv.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_reftree.c
zfs_context.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/unique.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_fuid.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/avl.h
339147 03-Oct-2018 mav

MFC r337229: Reduce taskq and context-switch cost of zio pipe

When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching. We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736

zfsonlinux/zfs@62840030a7dceaee013ddbcc1eebcfc7922edf7c

339141 03-Oct-2018 mav

MFC r337213: MFV r337212:
9465 ARC check for 'anon_size > arc_c/2' can stall the system

illumos/illumos-gate@abe1fd01ce5a83718c5a840daeab4abdaec1c104

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Don Brady <don.brady@delphix.com>

339140 03-Oct-2018 mav

MFC r337211: MFV r337210: 9577 remove zfs_dbuf_evict_key tsd

The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339131 03-Oct-2018 mav

MFC r337191:
MFV r337190: 9486 reduce memory used by device removal on fragmented pools

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

illumos/illumos-gate@cfd63e1b1bcf7ba4bf72f55ddbd87ce008d2986d

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339128 03-Oct-2018 mav

MFC r337181: 9539 Make zvol operations use _by_dnode routines

Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.

illumos/illumos-gate@8dfe5547fbf0979fc1065a8b6fddc1e940a7cf4f

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Richard Yao <richard.yao@prophetstor.com>

339126 03-Oct-2018 mav

MFC r337177:
MFV r337175: 9487 Free objects when receiving full stream as clone

All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.

zfsonlinux/zfs@48fbb9ddbf2281911560dfbc2821aa8b74127315
illumos/illumos-gate@7864b8192b8d30471fa2240466d516292e5765b8

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

339125 03-Oct-2018 mav

MFC r337172, MFV r337171:
9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes

Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.

illumos/illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339120 03-Oct-2018 mav

MFC r337169: MFV r337167: 9442 decrease indirect block size of spacemaps

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

illumos/illumos-gate@221813c13b43ef48330b03725e00edee85108cf1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339111 03-Oct-2018 mav

MFC r337007: MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/Makefile.files
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
metaslab_impl.h
spa.h
vdev_impl.h
vdev_initialize.h
zio_priority.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_initialize.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_removal.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
339109 03-Oct-2018 mav

MFC r336959: MFV r336958: 9337 zfs get all is slow due to uncached metadata

This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339108 03-Oct-2018 mav

MFC r336956: MFV r336955: 9236 nuke spa_dbgmsg

We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339106 03-Oct-2018 mav

MFC r336951: MFV r336950: 9290 device removal reduces redundancy of mirrors

Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339105 03-Oct-2018 mav

MFC r336949:
MFV r336948: 9112 Improve allocation performance on high-end systems

On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.

illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>

339104 03-Oct-2018 mav

MFC r336947: MFV r336946: 9238 ZFS Spacemap Encoding V2

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339034 01-Oct-2018 sef

MFC r334844, r336180, r336458

r334844

This originated from ZFS On Linux, as
https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

r336180

Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges. One of the missing changes
included a notification on scrub cancellation.

r336458

Fix a couple of typos in r334844 noticed by Richard Kojedzinszky

Approved by: mav
Sponsored by: iXsystems, Inc


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/taskq.c
/freebsd-11-stable/sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ddt.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/range_tree.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
arc.h
dsl_pool.h
dsl_scan.h
range_tree.h
spa_impl.h
vdev.h
vdev_impl.h
zio.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/taskq.h
338975 27-Sep-2018 mav

MFC r334810 (by benno), r338205, r338206:
r334810:
Break recursion involving getnewvnode and zfs_rmnode.

When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.

In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.

r338205:
Create separate taskqueue to call zfs_unlinked_drain().

r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every
deletion of a file with extended attributes. Using system_taskq for that
with its multiple threads in case of multiple files deletion caused all
available CPU threads to uselessly spin on busy locks, completely blocking
the system.

Use of single dedicated taskqueue is the only easy solution I've found,
while in would be great if we could specify that some task should be
executed only once at a time, but never in parallel, while many tasks
could use different threads same time.

r338206:
Add dmu_tx_assign() error handling in zfs_unlinked_drain().

The error handling got lost during r334810, while according to the report
error there may happen in case of dataset being over quota. In such case
just leave the node in the unlinked list to be freed sometimes later.

338484 05-Sep-2018 kib

MFC r338370:
Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.

332785 19-Apr-2018 mav

MFC r332523: 9433 Fix ARC hit rate

When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

332547 16-Apr-2018 mav

MFC r331701: MFV r331695, 331700: 9166 zfs storage pool checkpoint

illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb_il.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool-features.7
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/Makefile.files
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode_sync.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/range_tree.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_checkpoint.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
dmu.h
dsl_dir.h
dsl_pool.h
dsl_synctask.h
metaslab.h
metaslab_impl.h
range_tree.h
spa.h
spa_checkpoint.h
spa_impl.h
space_map.h
uberblock_impl.h
vdev.h
vdev_impl.h
vdev_removal.h
zio.h
zthr.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/uberblock.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_removal.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_synctask.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zthr.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
332543 16-Apr-2018 mav

MFC r331414: Reduce struct aggsum_bucket padding to fit into one cache line.

332540 16-Apr-2018 mav

MFC r331404: MFV r331400:
8484 Implement aggregate sum and use for arc counters

In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

332537 16-Apr-2018 mav

MFC r329802: MFV r329799, r329800:
9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b4f491d5e839644e0912cad1c9e7122

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

332536 16-Apr-2018 mav

MFC r329798: MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a2c5e23f4b970ea101137eadd1470a1

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332530 16-Apr-2018 mav

MFC r329765: MFV r329762: 8961 SPA load/import should tell us why it failed

illumos/illumos-gate@3ee8c80c747c4aa3f83351a6920f30c411236e1b

When we fail to open or import a storage pool, we typically don't get any
additional diagnostic information, just "no pool found" or "can not import".

While there may be no additional user-consumable information, we should at
least make this situation easier to debug/diagnose for developers and support.
For example, we could start by using `zfs_dbgmsg()` to log each thing that we
try when importing, and which things failed. E.g. "tried uberblock of txg X
from label Y of device Z". Also, we could log each of the stages that we go
through in `spa_load_impl()`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332525 16-Apr-2018 mav

MFC r329732: MFV r329502: 7614 zfs device evacuation/removal

illumos/illumos-gate@5cabbc6b49070407fb9610cfe73d4c0e0dea3e77

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_deleg.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_deleg.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/Makefile.files
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/bpobj.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ddt.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_zfetch.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/range_tree.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_reftree.c
bpobj.h
dbuf.h
dmu.h
dnode.h
dsl_dataset.h
dsl_deadlist.h
dsl_deleg.h
dsl_dir.h
dsl_pool.h
dsl_scan.h
metaslab.h
metaslab_impl.h
range_tree.h
spa.h
spa_impl.h
space_map.h
vdev.h
vdev_impl.h
vdev_indirect_births.h
vdev_indirect_mapping.h
vdev_removal.h
zfs_debug.h
zil.h
zio.h
zio_priority.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/txg.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect_births.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_indirect_mapping.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_removal.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_get.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
332524 16-Apr-2018 mav

MFC r307317: MFV r307313:
5120 zfs should allow large block/gzip/raidz boot pool (loader project)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

openzfs/openzfs@c8811bd3e2427dddbac6c05a59cfe117d8fea370

FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.

331611 27-Mar-2018 avg

MFC r330974: MFV r330973: 9164 assert: newds == os->os_dsl_dataset

PR: 225877

331399 22-Mar-2018 mav

MFC r329694: MFV r324198: 8081 Compiler warnings in zdb

illumos/illumos-gate@3f7978d02b206a6ebc5652c91aa9f42da6fbe00c
https://github.com/illumos/illumos-gate/commit/3f7978d02b206a6ebc5652c91aa9f42da6fbe00c

https://www.illumos.org/issues/8081
zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
which uses -WError, the only way to build it is to disable all compiler
warnings. This makes it much harder to detect newly introduced bugs. We should
cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>

331397 23-Mar-2018 mav

MFC r329690: MFV r319737: 6939 add sysevents to zfs core for commands

illumos/illumos-gate@ce1577b04976f1d8bb5f235b6eaaab15b46a3068
https://github.com/illumos/illumos-gate/commit/ce1577b04976f1d8bb5f235b6eaaab15b46a3068

https://www.illumos.org/issues/6939
Originally created https://smartos.org/bugview/OS-4489
sysevents should be fired in the kernel from ZFS whenever a command
is run that is logged in zpool history.
Example output
Terminal 1
root - gz sunos ~ # zfs create zones/foobar
root - gz sunos ~ # zfs set quota=10g zones/foobar
root - gz sunos ~ # zfs destroy zones/foobar
Terminal 2
root - gz sunos ~ # sysevent EC_zfs
nvlist version: 0
date = 2016-04-28T14:50:08.964Z
vendor = SUNW
publisher = zfs
class = EC_zfs
subclass = ESC_ZFS_history_event
pid = 0
data = (embedded nvlist)
nvlist version: 0
pool_name = zones
pool_guid = 0x40c964e8f9a7a694
history_record = (embedded nvlist)
nvlist version: 0
dsname = zones/foobar
dsid = 0x1525
history internal str =
internal_name = create
history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>

331395 22-Mar-2018 mav

MFC r329681: MFV r318941: 7446 zpool create should support efi system partition

illumos/illumos-gate@7855d95b30fd903e3918bad5a29b777e765db821
https://github.com/illumos/illumos-gate/commit/7855d95b30fd903e3918bad5a29b777e765db821

https://www.illumos.org/issues/7446
Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.
https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.

331384 22-Mar-2018 mav

MFC r329625: MFV r307315:
7301 zpool export -f should be able to interrupt file freeing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175

330991 15-Mar-2018 avg

MFC r329363: read-behind / read-ahead support for zfs_getpages()

330990 15-Mar-2018 avg

MFC r329823: another rework of getzfsvfs / getzfsvfs_impl code

330986 15-Mar-2018 avg

MFC r329717: MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue

330237 01-Mar-2018 avg

MFC r329314: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio

PR: 223803

329486 18-Feb-2018 mav

MFC r328228: MFV r328227: 8909 8585 can cause a use-after-free kernel panic

illumos/illumos-gate@94ddd0900a8838f62bba15e270649a42f4ef9f81

https://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only

> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race is present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

329485 18-Feb-2018 mav

MFC r328226: MFV r328225:
8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock"

illumos/illumos-gate@cf07d3da9915c0d22da8f59e991639f819463cef

https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".

Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

329484 18-Feb-2018 mav

MFC r328224: MFV r328220: 8677 Open-Context Channel Programs

illumos/illumos-gate@a3b2868063897ff0083dea538f55f9873eec981f

https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.

This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

325931 17-Nov-2017 avg

MFC r325610: MFV r325609: 7531 Assign correct flags to prefetched buffers

325541 08-Nov-2017 avg

MFC r324195: MFV r323795: 8604 Avoid unnecessary work search in VFS when unmounting snapshots

325538 08-Nov-2017 avg

MFC r324197: MFV r323913: 8600 ZFS channel programs - snapshot

325537 08-Nov-2017 avg

MFC r324196: MFV r323912: 8592 ZFS channel programs - rollback

325534 08-Nov-2017 avg

MFC r324163: MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups

Also MFC-ed are the following fixes:
- r324164 Add several new files to the files enabled by ZFS kernel option
- r324178 unbreak kernel builds on sparc64 and powerpc
- r324194 fix incorrect use of getzfsvfs_impl in r324163
- r324292 really unbreak kernel builds on sparc64 and powerpc64


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs-program.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_impl.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/kernel.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h
/freebsd-11-stable/cddl/lib/libzpool/Makefile
/freebsd-11-stable/cddl/sbin/zfs/Makefile
/freebsd-11-stable/sys/cddl/compat/opensolaris/kern/opensolaris_sunddi.c
/freebsd-11-stable/sys/cddl/compat/opensolaris/sys/sunddi.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/Makefile.files
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua/lstrlib.c
dsl_dataset.h
dsl_destroy.h
dsl_dir.h
zcp.h
zcp_global.h
zcp_iter.h
zcp_prop.h
zfs_ioctl.h
zfs_vfsops.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_get.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_global.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_iter.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zcp_synctask.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
/freebsd-11-stable/sys/conf/kern.pre.mk
/freebsd-11-stable/sys/modules/zfs/Makefile
325153 30-Oct-2017 avg

MFC r324349: MFV r322235: 8067 zdb should be able to dump literal embedded block pointer

325132 30-Oct-2017 avg

MFC r324011, r324016: MFV r323535: 8585 improve batching done in zil_commit()

FreeBSD notes:
- this MFV reverts FreeBSD commit r314549 to make the merge easier
- at present our emulation of cv_timedwait_hires is rather poor,
so I elected to use cv_timedwait_sbt directly
Please see the differential revision for details.
Unfortunately, I did not get any positive reviews, so there could be
bugs in the FreeBSD-specific piece of the merge.
Hence, the long MFC timeout.

illumos/illumos-gate@1271e4b10dfaaed576c08a812f466f6e81370e5e
https://github.com/illumos/illumos-gate/commit/1271e4b10dfaaed576c08a812f466f6e81370e5e

https://www.illumos.org/issues/8585
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

324205 02-Oct-2017 avg

MFC r323433,r323793,r323915: MFV r323110: 8558 lwp_create() returns EAGAIN
on system with more than 80K ZFS filesystems, and followups

r323433: MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems
r323793: MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure
r323915: MFV r323914: 8661 remove "zil-cw2" dtrace probe

324010 26-Sep-2017 avg

MFC r323355: MFV r323107: 8414 Implemented zpool scrub pause/resume

illumos/illumos-gate@1702cce751c5cb7ead878d0205a6c90b027e3de8
https://github.com/illumos/illumos-gate/commit/1702cce751c5cb7ead878d0205a6c90b027e3de8

FreeBSD note: rather than merging the zpool.8 update I copied the zpool
scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost
verbatim. Now that the illumos page uses the mdoc format, it was an
easier option. Perhaps the change is not in perfect compliance with the
FreeBSD style, but I think that it is acceptible.

https://www.illumos.org/issues/8414
This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
Currently, there is no way to pause a scrub. Pausing may be useful when
the pool is busy with other I/O to preserve bandwidth.

Description

This patch adds the ability to pause and resume scrubbing. This is achieved
by maintaining a persistent on-disk scrub state. While the state is 'paused'
we do not scrub any more blocks. We do however perform regular scan
housekeeping such as freeing async destroyed and deadlist blocks while paused.

Motivation and Context

Scrub pausing can be an I/O intensive operation and people have been asking
for the ability to pause a scrub for a while. This allows one to preserve scrub
progress while freeing up bandwidth for other I/O.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>

324004 26-Sep-2017 avg

MFC r323479,r323491: zfs: tighten debug versions of ZTOV and VTOZ

Sponsored by: Panzura

323758 19-Sep-2017 avg

MFC r322241: MFV r322240: 8491 uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS

illumos/illumos-gate@79c2b812ee2010ebf20fdd92dc5f06b59000a94c
https://github.com/illumos/illumos-gate/commit/79c2b812ee2010ebf20fdd92dc5f06b59000a94c

https://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

323757 19-Sep-2017 avg

MFC r322230: MFV r322229: 7600 zfs rollback should pass target snapshot to kernel

illumos/illumos-gate@77b171372ed21642e04c873ef1e87fe2365520df
https://github.com/illumos/illumos-gate/commit/77b171372ed21642e04c873ef1e87fe2365520df

https://www.illumos.org/issues/7600
At present, the kernel side code seems to blindly rollback to whatever happens
to be the latest snapshot at the time when the rollback task is processed.
The expected target's name should be passed to the kernel driver and the sync
task should validate that the target exists and that it is the latest snapshot
indeed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323749 19-Sep-2017 avg

MFC r322233: MFV r322232: 8426 mark immutable buffer arguments as such in abd.h

illumos/illumos-gate@9b195260e22529ac0e2580faaf89402420589c1c
https://github.com/illumos/illumos-gate/commit/9b195260e22529ac0e2580faaf89402420589c1c

https://www.illumos.org/issues/8426
abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
qualify them with const.
abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
corresponding arguments.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

321614 27-Jul-2017 mav

MFC r320352: zfs: port vdev_file part of illumos change 3306

3306 zdb should be able to issue reads in parallel
illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1
https://www.illumos.org/issues/3306

The upstream change was made before we started to import upstream commits
individually. It was imported into the illumos vendor area as r242733.
That commit was MFV-ed in r260138, but as the commit message says
vdev_file.c was left intact.

This commit actually implements the parallel I/O for vdev_file using a
taskqueue with multiple thread. This implementation does not depend on
the illumos or FreeBSD bio interface at all, but uses zio_t to pass
around all the relevent data. So, the code looks a bit different from
the upstream.

This commit also incorporates ZoL commit
zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed
https://github.com/zfsonlinux/zfs/issues/2270
We need to use a dedicated taskqueue for exactly the same reason as ZoL
as we do not implement TASKQ_DYNAMIC.

Obtained from: illumos, ZFS on Linux

321611 27-Jul-2017 mav

MFC r320237: MFV r318947: 7578 Fix/improve some aspects of ZIL writing.

FreeBSD note: this commit removes small differences between what mav
committed to FreeBSD in r308782 and what ended up committed to illumos
after addressing all review comments.

illumos/illumos-gate@c5ee46810f82e8a53d2cc5a487568a573f449039
https://github.com/illumos/illumos-gate/commit/c5ee46810f82e8a53d2cc5a487568a573f449039

https://www.illumos.org/issues/7578
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>

321610 27-Jul-2017 mav

MFC r320156, r320185, r320186, r320262, r320452, r321111:
MFV r318946: 8021 ARC buf data scatter-ization

illumos/illumos-gate@770499e185d15678ccb0be57ebc626ad18d93383
https://github.com/illumos/illumos-gate/commit/770499e185d15678ccb0be57ebc626ad1
8d93383

https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linea
r
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)

FreeBSD notes:
- the actual (default) chunk size is 4KB (despite the text above saying 1KB)
- we can try to reimplement ABDs, so that they are not permanently
mapped into the KVA unless explicitly requested, especially on
platforms with scarce KVA
- we can try to use unmapped I/O and avoid intermediate allocation of a
linear, virtual memory mapped buffer
- we can try to avoid extra data copying by referring to chunks / pages
in the original ABD

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb_il.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_sendrecv.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_fletcher.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_fletcher.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/Makefile.files
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/blkptr.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/ddt.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/edonr_zfs.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lz4.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sha256.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/skein_zfs.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
abd.h
ddt.h
spa.h
vdev_impl.h
zio.h
zio_checksum.h
zio_compress.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_cache.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_checksum.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c
/freebsd-11-stable/sys/conf/files
321609 27-Jul-2017 mav

MFC r320153: revert r315852 which introduced zio_buf_alloc_nowait for use
in vdev_queue_aggregate

I think that the change is still good, but reconciling it with a planned
merge of the ARC buf data scatter-ization is a bit more tedious
than I can handle.

321578 26-Jul-2017 mav

MFC r319949: MFV r319948: 5428 provide fts(), reallocarray(), and strtonum()

illumos/illumos-gate@4585130b259133a26efae68275dbe56b08366deb
https://github.com/illumos/illumos-gate/commit/4585130b259133a26efae68275dbe56b08366deb

https://www.illumos.org/issues/5428

Most of the upstream change is not applicable to FreeBSD.
Only the renaming of strtonum to zfs_strtonum is relevant to us.
And we already had it partially done.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>

321573 26-Jul-2017 mav

MFC r319748: MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers

illumos/illumos-gate@adaec86ad212d9fd756bee322934fa54d1258605
https://github.com/illumos/illumos-gate/commit/adaec86ad212d9fd756bee322934fa54d1258605

https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321570 26-Jul-2017 mav

MFC r318945: MFV r318944: 8265 Reserve send stream flag for large dnode feature

illumos/illumos-gate@bc83969fdbd1cb0d97ba00218c0a3de5c89fba92
https://github.com/illumos/illumos-gate/commit/bc83969fdbd1cb0d97ba00218c0a3de5c
89fba92

https://www.illumos.org/issues/8265
Reserve bit 23 in the zfs send stream flags for the large
dnode feature which has been implemented for Linux.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brian Behlendorf <behlendorf1@llnl.gov>

321567 26-Jul-2017 mav

MFC r318932: MFV r318931: 8063 verify that we do not attempt to access inactive txg

illumos/illumos-gate@b7b2590dd9f11b12a0b4878db3886068cce176af
https://github.com/illumos/illumos-gate/commit/b7b2590dd9f11b12a0b4878db3886068cce176af

https://www.illumos.org/issues/8063
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321554 26-Jul-2017 mav

MFC r318829: MFV r316920: 8023 Panic destroying a metaslab deferred range tree

illumos/illumos-gate@3991b535a8e990c0369be677746a87c259b13e9f
https://github.com/illumos/illumos-gate/commit/3991b535a8e990c0369be677746a87c259b13e9f

https://www.illumos.org/issues/8023
$C
ffffff0011bc0970 vpanic()
ffffff0011bc0a00 strlog()
ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00)
ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380)
ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800)
ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000)
ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0)
ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000)
ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000)
ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190)
ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>

321553 26-Jul-2017 mav

MFC r318828: MFV r316917: 7968 multi-threaded spa_sync()

illumos/illumos-gate@94c2d0eb22e9624151ee84a7edbf7178e1bf4087
https://github.com/illumos/illumos-gate/commit/94c2d0eb22e9624151ee84a7edbf7178e1bf4087

https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling
dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321552 26-Jul-2017 mav

MFC r318827: MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists

illumos/illumos-gate@10fbdecb05f411234920f8d3c92c148d39106d7e
https://github.com/illumos/illumos-gate/commit/10fbdecb05f411234920f8d3c92c148d39106d7e

https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321549 26-Jul-2017 mav

MFC r318823: MFC r316914: 7801 add more by-dnode routines

illumos/illumos-gate@b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe
https://github.com/illumos/illumos-gate/commit/b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe

https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/0eef1bde31d67091d3deed23fe2394f5a8bf2276

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>

321547 26-Jul-2017 mav

MFC r318821: MFV r316912: 7793 ztest fails assertion in dmu_tx_willuse_space

illumos/illumos-gate@61e255ce7267b52208af9daf434b77d37fb75622
https://github.com/illumos/illumos-gate/commit/61e255ce7267b52208af9daf434b77d37
fb75622

https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321545 26-Jul-2017 mav

MFC r318818: MFV r316907: 1300 filename normalization doesn't work for removes

illumos/illumos-gate@1c17160ac558f98048951327f4e9248d8f46acc0
https://github.com/illumos/illumos-gate/commit/1c17160ac558f98048951327f4e9248d8f46acc0

https://www.illumos.org/issues/1300

FreeBSD note: recent FreeBSD was not affected by the issue fixed as the
name cache is completely bypassed when normalization is enabled.
The change is imported for the sake of ZAP infrastructure modifications.

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

321541 26-Jul-2017 mav

MFC r317541: MFV 316905

7740 fix for 6513 only works in hole punching case, not truncation

illumos/illumos-gate@7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553
https://github.com/illumos/illumos-gate/commit/7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553

https://www.illumos.org/issues/7740
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
To verify, create a large file, truncate it to a short length, and then
write beyond the end. Check with zdb to make sure that there are no
holes with birth time zero (will appear as gaps).

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>

321540 26-Jul-2017 mav

MFC r317533: MFV 316900

7743 per-vdev-zaps have no initialize path on upgrade

illumos/illumos-gate@555da5111b0f2552c42d057b211aba89c9c79f6c
https://github.com/illumos/illumos-gate/commit/555da5111b0f2552c42d057b211aba89c9c79f6c

https://www.illumos.org/issues/7743
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

321539 26-Jul-2017 mav

MFC r317527: MFV 316898

7613 ms_freetree[4] is only used in syncing context

illumos/illumos-gate@5f145778012b555e084eacc858ead9e1e42bd149
https://github.com/illumos/illumos-gate/commit/5f145778012b555e084eacc858ead9e1e42bd149

https://www.illumos.org/issues/7613
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
replace it with two trees: the freeing tree (ranges that we are freeing this
syncing txg) and the freed tree (ranges which have been freed this txg).

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321538 26-Jul-2017 mav

MFC r317522: MFV 316897

7586 remove #ifdef __lint hack from dmu.h

illumos/illumos-gate@4ba5b9616327ef64e8abc737d29b3faabc6ae68c
https://github.com/illumos/illumos-gate/commit/4ba5b9616327ef64e8abc737d29b3faabc6ae68c

https://www.illumos.org/issues/7586
The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
we add other inline functions into header files in ZFS, especially since it is
difficult to make any other solution work across all compilation targets. We
should switch to disabling the lint flags that are failing instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>

321535 26-Jul-2017 mav

MFC r317414: MFV 316894

7252 7628 compressed zfs send / receive

illumos/illumos-gate@5602294fda888d923d57a78bafdaf48ae6223dea
https://github.com/illumos/illumos-gate/commit/5602294fda888d923d57a78bafdaf48ae6223dea

https://www.illumos.org/issues/7252
This feature includes code to allow a system with compressed ARC enabled to
send data in its compressed form straight out of the ARC, and receive data in
its compressed form directly into the ARC.

https://www.illumos.org/issues/7628
We should have longer, more readable versions of the ZFS send / recv options.

7628 create long versions of ZFS send / receive options

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>

321531 26-Jul-2017 mav

MFC r317235: MFV 316868

7430 Backfill metadnode more intelligently

illumos/illumos-gate@af346df58864e8fe897b1ff1a3a4c12f9294391b
https://github.com/illumos/illumos-gate/commit/af346df58864e8fe897b1ff1a3a4c12f9
294391b

https://www.illumos.org/issues/7430
Description and patch from brought over from the following ZoL commit: https:/
/
github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don?t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can?t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>

321530 26-Jul-2017 mav

MFC r316037: MFV: 315989

7603 xuio_stat_wbuf_* should be declared (void)

illumos/illumos-gate@99aa8b55058e512798eafbd71f72f916bdc10181
https://github.com/illumos/illumos-gate/commit/99aa8b55058e512798eafbd71f72f916bdc10181

https://www.illumos.org/issues/7603

The funcs are declared k&r style, where the args are not specified:

void xuio_stat_wbuf_copied();
They should be declared to take no arguments:

void xuio_stat_wbuf_copied(void);
Need to change both .c and .h.

Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

321529 26-Jul-2017 mav

MFC r315896: MFV r315290, r315291: 7303 dynamic metaslab selection

illumos/illumos-gate@8363e80ae72609660f6090766ca8c2c18aa53f0c
https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18

https://www.illumos.org/issues/7303

This change introduces a new weighting algorithm to improve metaslab selection
.
The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a res
ult,
the metaslab weight now encodes the type of weighting algorithm used
(size-based vs segment-based).

This also introduce a new allocation tracing facility and two new dcmds to hel
p
debug allocation problems. Each zio now contains a zio_alloc_list_t structure
that is populated as the zio goes through the allocations stage. Here's an
example of how to use the tracing facility:

> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 ENOSPC ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 ENOSPC ztest.0a
1 0 400 1 x 8M 17b1a00 ztest.0a

> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 200 0 NOT_ALLOCATABLE mirror-2
- 0 200 0 NOT_ALLOCATABLE mirror-0
1 0 200 1 x 4M 112ae00 mirror-1
- 1 200 0 NOT_ALLOCATABLE mirror-2
- 1 200 0 NOT_ALLOCATABLE mirror-0
1 1 200 1 x 4M 112b000 mirror-1
- 2 200 0 NOT_ALLOCATABLE mirror-2

If the metaslab is using segment-based weighting then the WEIGHT column will
display the number of segments available in the bucket where the allocation
attempt was made.

Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

321527 26-Jul-2017 mav

MFC r314267: MFV 314243

6676 Race between unique_insert() and unique_remove() causes ZFS fsid change

illumos/illumos-gate@40510e8eba18690b9a9843b26393725eeb0f1dac
https://github.com/illumos/illumos-gate/commit/40510e8eba18690b9a9843b26393725ee
b0f1dac

https://www.illumos.org/issues/6676

The fsid of zfs filesystems might change after reboot or remount. The problem
seems to
be caused by a race between unique_insert() and unique_remove(). The unique_re
move()
is called from dsl_dataset_evict() which is now an asynchronous thread. In a c
ase the
dsl_dataset_evict() thread is very slow and calls unique_remove() too late we
will end
up with changed fsid on zfs mount.

This problem is very likely caused by #5056.

Steps to Reproduce
Note: I'm able to reproduce this always on a single core (virtual) machine. On
multicore
machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

Impact
The persistent fsid (filesystem id) is essential for proper NFS functionality.
If the fsid of a filesystem changes on remount (or after reboot) the NFS
clients might not be able to automatically recover from such event and the
manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

321524 26-Jul-2017 mav

MFC r313813: MFV 313786

7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid

illumos/illumos-gate@653af1b809998570c7e89fe7a0d3f90992bf0216
https://github.com/illumos/illumos-gate/commit/653af1b809998570c7e89fe7a0d3f90992bf0216

https://www.illumos.org/issues/7500
With the integration of:

commit 0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date: Sat Jul 26 13:40:04 2014 -0800
4873 zvol unmap calls can take a very long time for larger datasets

the dnode's dn_bufs field was changed from a list to a tree. As a result,
the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>

321523 26-Jul-2017 mav

MFC r312535: MFV 312436

6569 large file delete can starve out write ops

illumos/illumos-gate@ff5177ee8bf9a355131ce2cc61ae2da6a5a6fdd6
https://github.com/illumos/illumos-gate/commit/ff5177ee8bf9a355131ce2cc61ae2da
6a5a6fdd6

https://www.illumos.org/issues/6569
The core issue I've found is that there is no throttle for how many
deletes get assigned to one TXG. As a results when deleting large files
we end up filling consecutive TXGs with deletes/frees, then write
throttling other (more important) ops.

There is an easy test case for this problem. Try deleting several
large files (at least 1/2 TB) while you do write ops on the same
pool. What we've seen is performance of these write ops (let's
call it sideload I/O) would drop to zero.

More specifically the problem is that dmu_free_long_range_impl()
can/will fill up all of the dirty data in the pool "instantly",
before many of the sideload ops can get in. So sideload
performance will be impacted until all the files are freed.

The solution we have tested at Nexenta (with positive results)
creates a relatively simple throttle for how many "free" ops we let
into one TXG.

However this solution exposes other problems that should also be
addressed. If we are to slow down freeing of data that means one
has to wait even longer (assuming vnode ref count of 1) to get shell
back after an rm or for NFS thread to finish the free-ing op.
To avoid this the proposed solution is to call zfs_inactive() async
for "large" files. Async freeing then begs for the reclaimed space
to be accounted for in the zpool's "freeing" prop.

The other issue with having a longer delete is the inability to
export/unmount for a longer period of time. The proposed solution
is to interrupt freeing of blocks when a fs is unmounted.

Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

316849 14-Apr-2017 avg

MFC r315852: zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate

315842 23-Mar-2017 avg

MFC r314048,r314194: reimplement zfsctl (.zfs) support

315441 17-Mar-2017 mav

MFC r308782:
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by: iXsystems, Inc.

314665 04-Mar-2017 avg

MFC r314273: zfs: call spa_deadman on a taskqueue thread

310511 24-Dec-2016 avg

MFC r309098: MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try
to free already free blocks

310509 24-Dec-2016 avg

MFC r309097: MFV r308987: 7180 potential race between
zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename

308914 21-Nov-2016 avg

MFC r308089: zfsbootcfg: a simple tool to set next boot (one time)
options for zfsboot

308595 13-Nov-2016 mav

MFC r308173:
Fix ZIL records ordering when ZVOL opened both with and without FSYNC.

Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.

308082 29-Oct-2016 mav

MFC r306424: MFV r306422:
7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs

dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.

Closes #180

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

307995 27-Oct-2016 avg

MFC r306801: implement zfs_vptocnp() using z_parent property

307299 14-Oct-2016 mav

MFC r305563: MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused

The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:

commit ca0cc3918a1789fa839194af2a9245f801a06b1a
Author: Matthew Ahrens <mahrens@delphix.com>
Date: Fri Jul 24 09:53:55 2015 -0700

5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

This patch simply removes this macro from dsl_dataset.h.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307290 14-Oct-2016 mav

MFC r305340: MFC r305337:
7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object

Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489a169ab36f9ec1b2a111a60a5563a9f

307286 14-Oct-2016 mav

MFC r305338: MFV r305335: 7003 zap_lockdir() should tag hold

zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5a2c13e04328b39ecd2a6d0d3c4f7cb

307284 14-Oct-2016 mav

MFC r305334: MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3b10689fc45f4930d2392f5fe1d9cfa
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
A test failure occurred where a send stream had only a BEGIN record. This
should not be possible if the send returns without error. Prevented this from
happening in the future by adding an assertion to dmu_send_impl() to verify
that if the function returns 0 (success) both a BEGIN and END record are
present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
END records sent), having dump_record() set flags appropriately, adding VERIFY
statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>

307282 14-Oct-2016 mav

MFC r305333: MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr

illumos/illumos-gate@bd56f80007857b960e0981ed0797ad8ec844a96b
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307277 14-Oct-2016 mav

MFC r305331: MFV r304155:
7090 zfs should improve allocation order and throttle allocations

illumos/illumos-gate@0f7643c7376dd69a08acbfc9d1d7d548b10c846a
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
will drive them asynchronously through the allocation stage which can result i
n
blocks being allocated out-of-order. It would be nice to preserve as much of
the logical order as possible.
In addition, the allocations are equally scattered across all top-level VDEVs
but not all top-level VDEVs are created equally. The pipeline should be able t
o
detect devices that are more capable of handling allocations and should
allocate more blocks to those devices. This allows for dynamic allocation
distribution when devices are imbalanced as fuller devices will tend to be
slower than empty devices.
The change includes a new pool-wide allocation queue which would throttle and
order allocations in the ZIO pipeline. The queue would be ordered by issued
time and offset and would provide an initial amount of allocation of work to
each top-level vdev. The allocation logic utilizes a reservation system to
reserve allocations that will be performed by the allocator. Once an allocatio
n
is successfully completed it's scheduled on a given top-level vdev. Each top-
level vdev maintains a maximum number of allocations that it can handle
(mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
groups and round robin across all eligible metaslab groups to distribute the
work. As top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>

307265 14-Oct-2016 mav

MFC r305323: MFV r302991: 6950 ARC should cache compressed data

illumos/illumos-gate@dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2
https://github.com/illumos/illumos-gate/commit/dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2

https://www.illumos.org/issues/6950
When reading compressed data from disk, the ARC should keep the compressed
block cached and only decompress it when consumers access the block. The
uncompressed data should be short-lived allowing the ARC to cache a much larger
amount of data. The DMU would also maintain a smaller cache of uncompressed
blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>

307112 12-Oct-2016 mav

MFC r305222: MFV r302993: 7104 increase indirect block size

illumos/illumos-gate@4b5c8e93cab28d3c65ba9d407fd8f46e3be1db1c
https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c

https://www.illumos.org/issues/7104
The current default indirect block size is 16KB. We can improve
performance by increasing it to 128KB. This is especially helpful for
any workload that needs to read most of the metadata, e.g.
scrub/resilver, file deletion, filesystem deletion, and zfs send.
We also need to fix a few space estimation errors to make the tests
pass.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307108 12-Oct-2016 mav

MFC r305209: MFV r302660: 6314 buffer overflow in dsl_dataset_name

illumos/illumos-gate@9adfa60d484ce2435f5af77cc99dcd4e692b6660
https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6
92b6660

https://www.illumos.org/issues/6314
Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zhack/zhack.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_changelist.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_diff.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_impl.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_iter.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_sendrecv.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/usr.sbin/zfsd/tests/zfsd_unittest.cc
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_namecheck.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c
dmu.h
dsl_dataset.h
dsl_dir.h
spa_impl.h
zap.h
zfs_znode.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h
307101 12-Oct-2016 mav

MFC r305197: MFV r302646:
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch

illumos/illumos-gate@ea4a67f462de0a39a9adea8197bcdef849de5371
https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84
9de5371

https://www.illumos.org/issues/6980
doing zfs send -i snap1 snap2 >testfile results in
internal error: Invalid argument
Abort (core dumped)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307049 11-Oct-2016 mav

MFC r305200: MFV r302651:
7054 dmu_tx_hold_t should use refcount_t to track space

illumos/illumos-gate@0c779ad424a92a84d1e07d47cab7f8009189202b
https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009
189202b

https://www.illumos.org/issues/7054
upstream:
ee0003de7d3e598499be7ac3fe6b61efcc47cb7f
DLPX-40399 dmu_tx_hold_t should use refcount_t to track space

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307046 11-Oct-2016 mav

MFC r305195: MFV r302643:
6902 speed up listing of snapshots if requesting name only and sorting by name

This was our change from the beginning, so just reduce the upstream diff.

304138 15-Aug-2016 avg

MFC r302838: 6513 partially filled holes lose birth time

304137 15-Aug-2016 avg

MFC r302837: 6844 dnode_next_offset can detect fictional holes

303970 11-Aug-2016 avg

MFC r303763,303791,303869: zfs: honour and make use of vfs vnode locking protocol

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Most importantly many things that FreeBSD VFS
manages on behalf of all filesystems are implemented in ZPL in a different
way.
Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS
functionality or, in the worst cases, badly interacts / interferes
with VFS.

The most prominent problem is a deadlock caused by the lock order reversal
of vnode locks that may happen with concurrent zfs_rename() and lookup().
The deadlock is a result of zfs_rename() not observing the vnode locking
contract expected by VFS.

This commit removes all ZPL internal locking that protects parent-child
relationships of filesystem nodes. These relationships are protected
by vnode locks and the code is changed to take advantage of that fact
and to properly interact with VFS.

Removal of the internal locking allowed all ZPL dmu_tx_assign calls to
use TXG_WAIT mode.

Another victim, disputable perhaps, is ZFS support for filesystems with
mixed case sensitivity. That support is not provided by the OS anyway,
so in ZFS it was a buch of dead code.

To do:
- replace ZFS_ENTER mechanism with VFS managed / visible mechanism
- replace zfs_zget with zfs_vget[f] as much as possible
- get rid of not really useful now zfs_freebsd_* adapters
- more cleanups of unneeded / unused code
- fix / replace .zfs support

PR: 209158
Approved by: re (gjb)

302408 08-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
301010 31-May-2016 allanjude

Connect the SHA-512t256 and Skein hashing algorithms to ZFS

Support for the new hashing algorithms in ZFS was introduced in r289422
However it was disconnected because FreeBSD lacked implementations of
SHA-512 (truncated to 256 bits), and Skein.

These implementations were introduced in r300921 and r300966 respectively

This commit connects them to ZFS and enabled these new checksum algorithms

This new algorithms are not supported by the boot blocks, so do not use them
on your root dataset if you boot from ZFS.

Relnotes: yes
Sponsored by: ScaleEngine Inc.


300132 18-May-2016 avg

zfsctl: tighten an assertion and remove an unused definition

There are only two entries under .zfs and 'shares' has an ID of a
special persistent object in its filesystem.

MFC after: 1 week


299441 11-May-2016 mav

MFV r299440: 6736 ZFS per-vdev ZAPs

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Joe Stein <joe.stein@delphix.com>

openzfs/openzfs@215198a6ad15cf4832370e2f19247abeb36b951a


297832 11-Apr-2016 mav

MFV r297831: 6322 ZFS indirect block predictive prefetch

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Author: Alexander Motin <mav@FreeBSD.org>

Improve speculative prefetch of indirect blocks.

Scalability of many operations on wide ZFS pool can be limited by
requirement to prefetch indirect blocks first. Recently added
asynchronous indirect block read partially helped, but did not
solve the problem completely. This patch extends existing prefetcher
functionality to explicitly work with indirect blocks.

Before this change prefetcher issued reads for up to 8MB of data in
advance. With this change it also issues indirect block reads
for up to 64MB of data in advance, so that when it will be time to
actually read those data, it can be done immediately. Alike effect
can be achieved by just increasing maximal data prefetch distance,
but at higher memory cost.

Also this change introduces indirect block prefetch for rewrite
operations, that was never done before. Previously ARC miss for
Indirect blocks regularly blocked rewrites, converting perfectly
aligned asynchronous operations into synchronous read-write pairs,
significantly reducing maximal rewrite speed.

While being there this issue was also fixed:
- prefetch was done always, even if caching for the dataset was
completely disabled.

Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool
of 12 assorted HDDs shown me such performance numbers:
------- BEFORE --------
Write 491363677 bytes/sec
Read 312430631 bytes/sec
Rewrite 97680464 bytes/sec
-------- AFTER --------
Write 493524146 bytes/sec
Read 438598079 bytes/sec
Rewrite 277506044 bytes/sec

Closes #65
Closes #80

openzfs/openzfs@792fd28ac04f78cc5e43ead2d72a96f244ea84e8


296519 08-Mar-2016 mav

MFV r296518: 5027 zfs large block support (add copyright)

Author: Matthew Ahrens <matt@mahrens.org>

illumos/illumos-gate@c3d26abc9ee97b4f60233556aadeb57e0bd30bb9


296516 08-Mar-2016 mav

MFV r296515: 6536 zfs send: want a way to disable setting of
DRR_FLAG_FREERECORDS

Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andrew Stormont <astormont@racktopsystems.com>

illumos/illumos-gate@880094b6062aebeec8eda6a8651757611c83b13e


296510 08-Mar-2016 mav

MFV r296505: 6531 Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@97e81309571898df9fdd94aab1216dfcf23e060b


294815 26-Jan-2016 mav

MFV r294814: 6393 zfs receive a full send as a clone

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>

illumos/illumos-gate@68ecb2ec930c4b0f00acaf8e0abb2b19c4b8b76f

This allows to do a full (non-incremental send) and receive it as a clone
of an existing dataset. It can leverage nopwrite to share blocks with the
origin. This can be used to change the relationship of datasets on the
target. For example, maybe on the source you have:

A ---- B ---- C

And you have sent to the target a full of B, and the incremental B->C:

B ---- C

You later realize that you want to have A on the target. You will have to
do a full send of A, but nopwrite can save you space on the target if you
receive it as a clone of B, assuming that A and B have some blocks inxi
common:

B ---- C
\
A


294811 26-Jan-2016 mav

MFV r294810: 6414 vdev_config_sync could be simpler

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Will Andrews <will@firepipe.net>

illumos/illumos-gate@eb5bb58421f46cee79155a55688e6c675e7dd361


294329 19-Jan-2016 asomers

Disallow zvol-backed ZFS pools

Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.

Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.

Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.

Reviewed by: delphij
MFC after: 1 week
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4986


289562 19-Oct-2015 mav

MFV r289561: 6328 Fix cstyle errors in zfs codebase

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

illumos/illumos-gate@9a686fbc186e8e2a64e9a5094d44c7d6fa0ea167


289422 16-Oct-2015 mav

MFV r289310:
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@45818ee124adeaaf947698996b4f4c722afc6d1f

This is only a partial merge of respective ZFS infrastructure changes.
At this moment FreeBSD kernel has no those crypto algorithms, so the
parts of the code to enable them are commented out. When they are
implemented, it will be trivial to plug them in.


289362 15-Oct-2015 mav

MFV r289312: 2605 want to resume interrupted zfs send

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@9c3fd1216fa7fb02cfbc78a2518a686d54b48ab8

For more info, see:
- slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive
- video https://www.youtube.com/watch?v=iY44jPMvxog
- manpage changes (for zfs resume -s and zfs send -t)
- upcoming talk at the OpenZFS Developer Summit

The TL;DR is:
Use "zfs receive -s" to save the partially received state on failure.
On failure, get the receive token with "zfs get receive_resume_token <fs>"
Resume the send with "zfs send -t <token_value>"

Relnotes: yes


289309 14-Oct-2015 mav

MFV r289308: 6267 dn_bonus evicted too early

Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Justin T. Gibbs <gibbs@FreeBSD.org>

illumos/illumos-gate@d2058105c61ec61df3a2dd3f839fed8c3fe7bfd6


288204 25-Sep-2015 delphij

MFV r288063: make dataset property de-registration operation O(1)

A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.

The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().

One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.

Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.

The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h:
Associate dsl property callback records with both the
dsl directory and the dsl dataset that is registering the
callback. Both connections are protected by the dsl directory's
"dd_lock".

When linking callbacks into a dsl directory, group them by
the property type. This helps reduce the space penalty for the
double association (the property name pointer is stored once
per dsl_dir instead of in each record) and reduces the number of
strcmp() calls required to do callback processing when updating
a single property. Property types are stored in a linked list
since currently ZFS registers a maximum of 10 property types
for each dataset.

Note that the property buckets/records associated with a dsl
directory are created on demand, but only freed when the dsl
directory is freed. Given the static nature of property types
and their small number, there is no benefit to freeing the few
bytes of memory used to represent the property record earlier.
When a property record becomes empty, the dsl directory is either
going to become unreferenced a little later in this thread of
execution, or there is a high chance that another dataset is
going to be loaded that would recreate the bucket anyway.

Replace dsl_prop_unregister() with dsl_prop_unregister_all().
All callers of dsl_prop_unregister() are trying to remove
all property registrations for a given dsl dataset anyway. By
changing the API, we can avoid doing any lookups of callbacks
by property type and just traverse the list of all callbacks
for the dataset and free each one.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
Replace use of dsl_prop_unregister() with the new
dsl_prop_unregister_all() API.

illumos/illumos-gate@03bad06fbb261fd4a7151a70dfeff2f5041cce1f
Author: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Illumos issue:
6171 dsl_prop_unregister() slows down dataset eviction
https://www.illumos.org/issues/6171

MFC after: 2 weeks


287706 12-Sep-2015 delphij

MFV r287699: 6214 zpools going south

In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced. This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.

Prevent this by reintroducing b_compress

illumos/illumos-gate@d4cd038c92c36fd0ae35945831a8fc2975b5272c

Illumos issues:

6214 zpools going south
https://www.illumos.org/issues/6214


287702 12-Sep-2015 delphij

MFV r287624: 5987 zfs prefetch code needs work

Rewrite the ZFS prefetch code to detect only forward, sequential
streams.

The following kstats have been added:

kstat.zfs.misc.arcstats.sync_wait_for_async

How many sync reads have waited for async read
to complete. (less is better)

kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch

How many demand read didn't have to wait for I/O
because of predictive prefetch. (more is better)

zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.

The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.

illumos/illumos-gate@cf6106c8a0d6598b045811f9650d66e07eb332af

Illumos ZFS issues:

5987 zfs prefetch code needs work
https://www.illumos.org/issues/5987


287103 24-Aug-2015 avg

MFV (partial) r286889: 5692 expose the number of hole blocks in a file

FreeBSD porting notes:
- only kernel-side changes are merged
- the new ioctl is not actually implemented yet
- thus, the goal is to synchronize DMU code

illumos/illumos-gate@2bcf0248e992f292c7b814458bcdce2f004925d6

https://www.illumos.org/issues/5692
we would like to expose the number of hole (sparse) blocks in a file.
this can be useful to for example if you want to fill in the holes with
some data; knowing the number of holes in advances allows you to report
progress on hole filling. We could use SEEK_HOLE to do that but it would
be O(n) where n is the number of holes present in the file.

Author: Max Grossman <max.grossman@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>


286763 14-Aug-2015 mav

MFV r277431: 5497 lock contention on arcs_mtx

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@244781f10dcd82684fd8163c016540667842f203

This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.


286708 13-Aug-2015 mav

MFV 286707: 5959 clean up per-dataset feature count code

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@ca0cc3918a1789fa839194af2a9245f801a06b1a

A ZFS feature flags (large blocks) tracks its refcounts as the number of
datasets that have ever used the feature. Several features of this type
are planned to be added (new checksum functions). This code should be made
common infrastructure rather than duplicating the code for each feature.


286705 12-Aug-2015 mav

MFV r286704: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.


286689 12-Aug-2015 mav

MFV r284763: 5981 Deadlock in dmu_objset_find_dp

illumos/illumos-gate@1d3f896f5469c69c1339890ec3d68e9feddb0343

https://www.illumos.org/issues/5981
When dmu_objset_find_dp gets called with a read lock held, it fans out
the work to the task queue. Each task in turn acquires its own read
lock before calling the callback. If during this process anyone tries
to a acquire a write lock, it will stall all read lock requests.Thus
the tasks will never finish, the read lock of the caller will never
get freed and the write lock never acquired. deadlock.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <jansen@webgods.de>


286686 12-Aug-2015 mav

MFV r284762: 5269 zpool import slow

illumos/illumos-gate@12380e1e701fda28c9e9f32d01cafb54af279eb5

https://www.illumos.org/issues/5269
When importing a pool (at boot or with zpool import) with many
filesystem, the process can take minutes. It doesn't matter whether
the pool has been exported cleanly or uncleanly. The problem is that
each dataset has its own log chain. On import, all datasets have to be
checked if there are logs to replay. The idea is to speed up this
process by paralellizing it.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Arne Jansen <jansen@webgods.de>


286683 12-Aug-2015 mav

MFV r286682: 5765 add support for estimating send stream size with
lzc_send_space when source is a bookmark

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
Author: Max Grossman <max.grossman@delphix.com>

illumos/illumos-gate@643da460c8ca583e39ce053081754e24087f84c8


286677 12-Aug-2015 mav

MFV r286224: 5695 dmu_sync'ed holes do not retain birth time

illumos/illumos-gate@70163ac57e58ace1c5c94dfbe85dca5a974eff36

https://www.illumos.org/issues/5695
In dmu_sync_ready(), a hole block pointer will have it's logical size
explicitly set as it's necessary for replay purposes. To "undo" this,
dmu_sync_done() will zero out any hole that it finds. This becomes a
problem when using the "hole_birth" feature, as this will also wipe out
any birth time that might have happened to be set on the hole.
...
As a fix, the logic to zero out a hole is only applied to old style
holes with a birth time of zero. Holes created with the "hole_birth"
feature enabled will have a non-zero birth time, and will be skipped
(thus preserving the ltime, type, and level information as well).
In addition, zdb was updated to also print the ltime, type, and level
information for these new style holes. Previously, only the logical
birth time would be printed.

Author: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>


286603 10-Aug-2015 mav

MFV 286602: 5810 zdb should print details of bpobj

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@732885fca09e11183dd0ea69aaaab5588fb7dbff


286587 10-Aug-2015 mav

MFV 286586: 5746 more checksumming in zfs send

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@98110f08fa182032082d98be2ddb9391fcd62bf1


286579 10-Aug-2015 mav

MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@fe319232d24f4ae183730a5a24a09423d8ab4429


286575 10-Aug-2015 mav

MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds

Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin Gibbs <justing@spectralogic.com>

illumos/illumos-gate@bc9014e6a81272073b9854d9f65dd59e18d18c35


286574 10-Aug-2015 mav

MFV r277427: 5445 Add more visibility via arcstats; specifically
arc_state_t stats and differentiate between "data" and "metadata"

Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@4076b1bf41cfd9f968a33ed54a7ae76d9e996fe8


286570 10-Aug-2015 mav

MFV r277426: 5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <Chris.Williamson@delphix.com>

illumos/illumos-gate@89c86e32293a30cdd7af530c38b2073fee01411c

Currently, every buffer cached in the L2ARC is accompanied by a 240-byte
header in memory, leading to very high memory consumption when using very
large cache devices. These changes significantly reduce this overhead.

Currently:

L1-only header = 176 bytes
L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr
= 240 bytes

Memory-optimized:

L1-only header = 176 bytes
L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes
L2-only header = 96 bytes + 32 byte checksum = 128 bytes

So overall:

Trunk Optimized
+-----------------+
L1-only | 176 B | 176 B | (same)
+-----------------+
L1 & L2 | 240 B | 208 B | (saved 32 bytes)
+-----------------+
L2-only | 240 B | 128 B | (saved 116 bytes)
+-----------------+

For an average blocksize of 8KB, this means that for the L2ARC, the ratio
of metadata to data has gone down from about 2.92% to 1.56%. For a
'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this
means that we expect a completely full L2ARC to use (1600 GB * 0.0156) /
60GB = 41% of the available memory, down from 78%.


286547 09-Aug-2015 mav

MFV 286546:
5661 ZFS: "compression = on" should use lz4 if feature is enabled

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@db1741f555ec79def5e9846e6bfd132248514ffe


286545 09-Aug-2015 mav

MFV 286544:
5630 stale bonus buffer in recycled dnode_t leads to data corruption

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>


286541 09-Aug-2015 mav

MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@e57a022b8f718889ffa92adbde47a8f08abcdb25


284593 19-Jun-2015 avg

MFV r284412: 5911 ZFS "hangs" while deleting file

Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@46e1baa6cf6d5432f5fd231bb588df8f9570c858

https://www.illumos.org/issues/5911
Sometimes ZFS appears to hang while deleting a file. It is actually
making slow progress at the file deletion, but other operations
(administrative and writes via the data path) "hang" until the file
removal completes, which can take a long time if the file has many
blocks. The deletion (or most of it) happens in a single txg, and the
sync thread spends most of its time reading indirect blocks via this
stack trace:
swtch+0x141()
cv_wait+0x70()
zio_wait+0x5b()
dbuf_read+0x2c0()
free_children+0x50()
free_children+0x12a()
free_children+0x12a()
free_children+0x12a()
dnode_sync_free_range_impl+0xdf()
dnode_sync_free_range+0x52()
range_tree_vacate+0x65()
dnode_sync+0x1d8()
dmu_objset_sync_dnodes+0x77()
dmu_objset_sync+0x19f()
dsl_dataset_sync+0x51()
dsl_pool_sync+0x9a()
spa_sync+0x2ff()
txg_sync_thread+0x21f()
thread_start+8()
One way to reproduce the problem is if we are over the arc_meta_limit,
e.g. because lots of indirect blocks are pinned because we have L0
dbufs under them. It could be that most of the L1 indirects are cached,
in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(),
it will complete very quickly. This allows dmu_free_long_range_impl() to
put many (perhaps all of its) transactions in the same TXG. However,
dmu_free_long_range_impl() calls dnode_evict_dbufs (and
dnode_free_range()), which removes the L0 dbufs, thus reducing the hold
count on the L1 indirect blocks above it, allowing them to be evicted.
Because we are over the arc_meta_limit(), these L1 blocks will be
evicted ASAP. Thus when we get to syncing context, the L1 indirects are
no longer cached and must be read in.

Obtained from: illumos
MFC after: 15 days


284304 12-Jun-2015 avg

MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size

illumos/illumos-gate@81cd5c555f505484180a62ca5a2fbb00d70c57d6

Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 17 days


282475 05-May-2015 avg

zfs: do not hold an extra reference on a root vnode while a filesystem is mounted

At present zfs_domount() acquires a reference on the filesystem's root vnode
and that reference is kept until zfs_umount.
The latter calls vflush(rootrefs = 1) to dispose of the extra reference.

There is no explanation of why that reference is kept - what problem it
solves or what behavior it improves.
Also, that logic is FreeBSD specific.

There is one real problem with that reference, though.
zfs recv -F may receive a full, non-incremental stream to a mounted filesystem.
In that case the received root object is likely to have a different z_gen
attribute value. Because of that, zfs_rezget will leave the previous root znode
and vnode disassociated from the actual object (z_sa_hdl == NULL).
Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode
pair, while the old one will be kept alive by the outstanding reference.
So, the outstanding reference will not actually be for the new root vnode
(or, more precisely, vnodes - because a root vnode may be recycled and a newer
one can be created).
As a result, when vflush(rootrefs = 1) s called there will be two problems:

- a leaked reference on the old root vnode preventing a graceful unmount
- insufficient references on the actual root vnode leading to a crash upon
access to the vnode after it is destroyed by vgone() + vdrop()

The second issue will actually override the first one.

Differential Revision: https://reviews.freebsd.org/D2353
Reviewed by: delphij, kib, smh
MFC after: 17 days


277300 17-Jan-2015 smh

Mechanically convert cddl sun #ifdef's to illumos

Since the upstream for cddl code is now illumos not sun, mechanically
convert all sun #ifdef's to illumos #ifdef's which have been used in all
newer code for some time.

Also do a manual pass to correct the use if #ifdef comments as per style(9)
as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.

MFC after: 1 month
Sponsored by: Multiplay


276069 22-Dec-2014 smh

Fix panic when resizing ZFS zvol's

Resizing a ZFS ZVOL with debug enabled would result in a panic due to
recursion on dp_config_rwlock.

The upstream change "3464 zfs synctask code needs restructuring" changed
zvol_set_volsize to avoid the recursion on dp_config_rwlock, but this was
missed when originally merged in by r248571 due to significant differences
in our codebases in this area.

These changes also relied on bring in changes from upstream:
3557 dumpvp_size is not updated correctly when a dump zvol's size is
changed, which where also not present.

In order to help prevent future issues in this area a direct comparison
and diff minimisation from current upstream version (b515258) of zvol.c.

Differential Revision: https://reviews.freebsd.org/D1302
MFC after: 1 month
X-MFC-With: r276063 & r276066
Sponsored by: Multiplay


275811 15-Dec-2014 delphij

MFV r275783:

Convert ARC flags to use enum. Previously, public flags are defined in
arc.h and private flags are defined in arc.c which can lead to confusion
and programming errors.

Consistently use 'hdr' (when referencing arc_buf_hdr_t) instead of 'buf'
or 'ab' because arc_buf_t are often named 'buf' as well.

Illumos issue:
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme

MFC after: 2 weeks


275782 15-Dec-2014 delphij

MFV r275551:

Remove "dbuf phys" db->db_data pointer aliases.

Use function accessors that cast db->db_data to the appropriate
"phys" type, removing the need for clients of the dmu buf user
API to keep properly typed pointer aliases to db->db_data in order
to conveniently access their data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
In zap_leaf() and zap_leaf_byteswap, now that the pointer alias
field l_phys has been removed, use the db_data field in an on
stack dmu_buf_t to point to the leaf's phys data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
Remove the db_user_data_ptr_ptr field from dbuf and all logic
to maintain it.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
Modify the DMU buf user API to remove the ability to specify
a db_data aliasing pointer (db_user_data_ptr_ptr).

cddl/contrib/opensolaris/cmd/zdb/zdb.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_diff.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
Create and use the new "phys data" accessor functions
dsl_dir_phys(), dsl_dataset_phys(), zap_m_phys(),
zap_f_phys(), and zap_leaf_phys().

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
Remove now unused "phys pointer" aliases to db->db_data
from clients of the DMU buf user API.

Illumos issue:
5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS

MFC after: 2 weeks


275781 15-Dec-2014 delphij

MFV r275550:

In addition to r273158, make the code in spa_sync() that checks if the
current TXG is a no-op TXG less fragile.

Illumos issue:
5347 idle pool may run itself out of space

MFC after: 2 weeks


275740 13-Dec-2014 delphij

MFV r275548:

Verify that the block pointer is structurally valid, before attempting to
read it in. It can only be invalid in the case of a ZFS bug, but this
change will help identify such bugs in a more transparent way, by
panic'ing with a relevant message, rather than indexing off the end of an
array or something.

Illumos issue:
5349 verify that block pointer is plausible before reading

MFC after: 2 weeks


275594 08-Dec-2014 delphij

MFV r275540:

When importing a pool, don't assume that the passed pool configuration
at vdev_load is always vaild. It's possible that a stale configuration
that comes with extra vdevs, where metaslab_init() would fail because
of lower layer returns error.

Change the code to make metaslab_init() handle and return errors from
lower layer and pass it back to upper layer and handle it there.

Illumos issue:
5213 panic in metaslab_init due to space_map_open returning ENXIO

MFC after: 2 weeks


274337 10-Nov-2014 delphij

MFV r274273:

ZFS large block support.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.

Limited safety belt is provided for mounted root filesystem but use
caution is advised.

Illumos issue:
5027 zfs large block support

MFC after: 1 month


274304 09-Nov-2014 delphij

MFV r274272 and diff reduction with upstream.

Illumos issue:
5244 zio pipeline callers should explicitly invoke next stage

Tested with: ztest plus ZFS over GELI configuration
MFC after: 1 month


272810 09-Oct-2014 delphij

MFV r272804:

Refactor the code and stop restore_object from creating two transactions.

Illumos issue:
3693 restore_object uses at least two transactions to restore an object

MFC after: 2 weeks


272809 09-Oct-2014 delphij

MFV r272803:

Illumos issue:
5175 implement dmu_read_uio_dbuf() to improve cached read performance

MFC after: 2 weeks


272598 06-Oct-2014 delphij

MFV r272585:

Split the godfather zio into CPU number's to reduce lock
contention.

Illumos issue:
5176 lock contention on godfather zio

MFC after: 2 weeks


272504 04-Oct-2014 delphij

MFV r272494:

Make space_map_truncate() always do space_map_reallocate(). Without
this, setting space_map_max_blksz would cause panic for existing pool,
as dmu_objset_set_blocksize would fail if the object have multiple blocks.

Illumos issues:
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
spacemap_histogram feature

MFC after: 2 weeks


271754 18-Sep-2014 smh

Remove unused ZFS ARC functions

* arc_data_buf_alloc
* arc_data_buf_free

MFC after: 1 week
Sponsored by: Multiplay


271526 13-Sep-2014 delphij

MFV r271510:

Enforce 4K as smallest indirect block size (previously the smallest
indirect block size was 1K but that was never used).

This makes some space estimates more accurate and uses less memory
for some data structures.

Illumos issue:
5141 zfs minimum indirect block size is 4K

MFC after: 2 weeks


270383 22-Aug-2014 delphij

Instead of using timestamp in the AVL, use the memory address when
comparing.

Illumos issue:
5095 panic when adding a duplicate dbuf to dn_dbufs

MFC after: 3 days


270247 20-Aug-2014 delphij

MFC r270195:

Illumos issue:
5045 use atomic_{inc,dec}_* instead of atomic_add_*

MFC after: 2 weeks


269431 02-Aug-2014 delphij

MFV r269427:

In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.

This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.

Illumos issue:
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>

MFC after: 2 weeks


269407 01-Aug-2014 smh

Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods

This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.

We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.

zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.

Cleanup flow in vdev_geom_io_start while I'm here.

Also fix some cases not using SET_ERROR(..)

MFC after: 2 weeks
X-MFC-With: r265321


269229 29-Jul-2014 delphij

MFV r269223:

Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets

MFC after: 2 weeks


269118 26-Jul-2014 delphij

MFV r269010:

Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric

MFC after: 2 weeks


269086 25-Jul-2014 delphij

As of r268075, the responsibility of rounding up buffer to optimal size have
been transferred from zio_compress_data to its caller. Therefore, passing
the 'minblocksize' down will be a no-op.

Eliminate the parameter to reduce diff against upstream.

MFC after: 2 weeks


268980 22-Jul-2014 delphij

Correct typo introduced with r268855.

MFC after: 10 days
X-MFC with: r268855


268865 19-Jul-2014 delphij

Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.

Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.

Illumos issue:
5008 lock contention (rrw_exit) while running a read only load

MFC after: 2 weeks


268859 18-Jul-2014 delphij

MFV r268851:

When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).

Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting

MFC after: 2 weeks


268858 18-Jul-2014 delphij

MFV r268850:

Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.

The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.

Illumos issue:
4631 zvol_get_stats triggering too many reads

MFC after: 2 weeks


268855 18-Jul-2014 delphij

MFV r268848:

Instead of asserting all zio's be properly aligned, only assert
on the logical ones.

Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.

This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).

While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.

Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe

MFC after: 2 weeks


268473 09-Jul-2014 delphij

MFV r268455:

Use reserved space for ZFS administrative commands.

We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.

Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.

A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.

MFC after: 2 weeks


268464 09-Jul-2014 delphij

MFV r268452:

Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue: 4950 files sometimes can't be removed from a full
filesystem
MFC after: 2 weeks


268123 01-Jul-2014 delphij

MFV r268119:

4914 zfs on-disk bookmark structure should be named *_phys_t

illumos/illumos-gate@7802d7bf98dec568dadf72286893b1fe5abd8602

MFC after: 2 weeks


268079 01-Jul-2014 delphij

MFV r267566:

4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after: 2 weeks


268075 01-Jul-2014 delphij

MFV r267565:

4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

MFC after: 2 weeks


266771 27-May-2014 delphij

MFV r266766:

Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after: 2 weeks


265321 04-May-2014 smh

Use a zio flag to prevent recursion of vdev_queue_io_done which can
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.

This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.

MFC after: 1 month
X-MFC-With: r265152


265152 30-Apr-2014 smh

Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority

The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after: 1 month


265046 28-Apr-2014 smh

Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.

This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.

MFC after: 2 days


264850 24-Apr-2014 smh

Add the ability to set a minimum ashift size for ZFS pool creation or root level
vdev addition.

Change max_auto_ashift sysctl to error when an invalid value is requested instead
of silently limiting it.


264835 23-Apr-2014 delphij

MFV r264829:

3897 zfs filesystem and snapshot limits

MFC after: 2 weeks


264671 18-Apr-2014 delphij

MFV r264668:

4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

illumos/illumos@b6240e830b871f59c22a3918aebb3b36c872edba

MFC after: 2 weeks


264669 18-Apr-2014 delphij

MFV r264666:

4374 dn_free_ranges should use range_tree_t

illumos/illumos-gate@bf16b11e8deb633dd6c4296d46e92399d1582df4

MFC after: 2 weeks


260185 02-Jan-2014 delphij

MFV r260155:

When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark. This would lead to double free of bp's which will lead to
space map corruption.

Instead of tolerating and allowing the corruption, panic immediately.

See Illumos #4390 for more details.

4391 panic system rather than corrupting pool if we hit bug 4390

Illumos/illumos-gate@8b36997aa24d9817807faa4efa301ac9c07a2b78

MFC after: 2 weeks


260183 02-Jan-2014 delphij

MFV r260154 + 260182:

4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools

Illumos/illumos-gate@78f171005391b928aaf1642b3206c534ed644332

MFC after: 2 weeks


260150 01-Jan-2014 delphij

MFV r259170:

4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47bfcd2ad9bf501faec8e75c08095e4f

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.

MFC after: 2 weeks


260141 31-Dec-2013 delphij

MFV r258385:

(Note: this change is not applicable to FreeBSD and the file
is not included in build. It's integrated for completeness).

4128 disks in zpools never go away when pulled

illumos/illumos-gate@39cddb10a31c1c2e66aed69e6871d09caa4c8147

MFC after: 2 weeks


260138 31-Dec-2013 delphij

MFV r242733:

3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message

illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1

FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use. vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.

MFC after: 2 weeks


259813 24-Dec-2013 delphij

MFV r258374:

4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7808606888f8f92715629ff3ba555b9

MFC after: 2 weeks


258745 29-Nov-2013 avg

zfs: add dmu_write_pages variant for freebsd

The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.

MFC after: 15 days


258717 28-Nov-2013 avg

MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control

4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab

illumos/illumos-gate@0713e232b7712cd27d99e1e935ebb8d5de61c57d

Note that some tunables have been removed and some new tunables have
been added. Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).

MFC after: 11 days
Sponsored by: HybridCluster [merge]


258634 26-Nov-2013 avg

MFV r258376: 3964 L2ARC should always compress metadata buffers

illumos/illumos-gate@e4be62a2b74a8f09bb669217a1a39eee069b13a1

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258633 26-Nov-2013 avg

MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit

4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold

illumos/illumos-gate@22e30981d82a0b6dc89253596ededafae8655e00

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258632 26-Nov-2013 avg

MFV r255255: 4045 zfs write throttle & i/o scheduler performance work

illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258631 26-Nov-2013 avg

MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot

illumos/illumos-gate@ec94d32216ed5705f5176582355cc311cf848e73

MFC after: 9 days
Sponsored by: HybridCluster [merge]


258630 26-Nov-2013 avg

734 taskq_dispatch_prealloc() desired

943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e3be0c51e86f73096334611ae3a058e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after: 8 days


258137 14-Nov-2013 mav

Introduce allocation cache to store LZ4 compression contexts without kicking
VM subsystem twice for every written record.

Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.

This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.

Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>


256956 23-Oct-2013 smh

Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay


255750 21-Sep-2013 delphij

MFV r254750:

Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools

MFC after: 1 month
Approved by: re (ZFS blanket)


255437 10-Sep-2013 delphij

MFV r247844 (illumos-gate 13975:ef6409bc370f)

Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by: re (ZFS blanket)


254753 24-Aug-2013 delphij

MFV r254747:

Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive. This is a regression from FreeBSD r253821.

Illumos ZFS issues:
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive


254711 23-Aug-2013 avg

zfs: inline and remove zfs_vnode_lock

It didn't serve any useful purpose, but obscured file and line information
useful for debugging.

MFC after: 5 days
X-MFC with: r254445


254608 21-Aug-2013 gibbs

Add kstat entries for ZFS compression statistics.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
Add module lifetime functions to allocate and teardown
state data.

Report:
- Compression attempts.
- Buffers found to be empty.
- Compression calls that are skipped because
the data length is already less than or
equal to the minimum block length.
- Compression attempts that fail to yield a 12.5%
compression ratio.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
Add calls to the zio_compress.c module's init and fini
functions.

Sponosred by: Spectra Logic Corporation
MFC after: 2 weeks


254591 21-Aug-2013 gibbs

Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.

2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.

Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.

Populate the new fields in the vdev_stat_t structure.

Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.

Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.

In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

For each sub-optimally configured leaf vdev, report configured
and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.

Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.


254587 21-Aug-2013 delphij

MFV r254421:

Illumos ZFS issues:
3996 want a libzfs_core API to rollback to latest snapshot


254112 08-Aug-2013 delphij

MFV r254079:

Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting


253992 06-Aug-2013 mav

Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition. Otherwise it queues free request as
it was before r252840.


253990 06-Aug-2013 mav

Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.

In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.

Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.


253821 30-Jul-2013 delphij

MFV r253783:

Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
3834 incremental replication of 'holey' file systems is slow

MFC after: 2 weeks


253820 30-Jul-2013 delphij

MFV r253782:

To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
3888 zfs recv -F should destroy any snapshots created since the
incremental source

MFC after: 2 weeks


253819 30-Jul-2013 delphij

MFV r253781 + r253871:

Illumos ZFS issues:
3894 zfs should not allow snapshot of inconsistent dataset

MFC after: 2 weeks


253816 30-Jul-2013 delphij

MFV r253780:

To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
3875 panic in zfs_root() after failed rollback

MFC after: 2 weeks


253643 25-Jul-2013 mav

Following r222950, revert unintentional change cls -> class in argument name
in r245264. Aside from non-uniformity, that again confused C++ compilers.


252840 05-Jul-2013 mm

MFV r252839:

Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).

By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.

Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case

MFC after: 1 week


252431 30-Jun-2013 rmh

Enable kernel-specific code for FreeBSD also on other systems that use
the kernel of FreeBSD.

Reviewed by: pjd


251646 12-Jun-2013 delphij

MFV r251644:

Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)

Illumos ZFS issues:
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing

MFC after: 2 weeks


251636 11-Jun-2013 delphij

MFV r251626:

ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems

MFC after: 2 weeks


251633 11-Jun-2013 delphij

MFV r251622:

ZFS shouldn't ignore errors unmounting snapshots

Illumos ZFS issues:
3744 zfs shouldn't ignore errors unmounting snapshots

MFC after: 2 weeks


251631 11-Jun-2013 delphij

MFV r251620:

ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
3741 zfs comments need cleaner, more consistent style

MFC after: 2 weeks


251629 11-Jun-2013 delphij

MFV r251619:

ZFS needs better comments.

Illumos ZFS issues:
3741 zfs needs better comments

MFC after: 2 weeks


251520 08-Jun-2013 delphij

MFV r251519:

* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

ZFS should proactively evict freed blocks from the cache.

Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:

1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

MFC after: 2 weeks


251478 06-Jun-2013 delphij

MFV r251474:

* Illumos zfs issue #3137 L2ARC compression

Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.

The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.

MFC after: 2 weeks


249921 26-Apr-2013 smh

Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled
Enabled ZFS TRIM by default

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


249858 24-Apr-2013 mm

MFV r249857:

Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.

Illumos ZFS issues:
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock

MFC after: 1 week


249206 06-Apr-2013 mm

MFV r248660:
Merge vendor change - modify time processing in deadman thread.

Illumos ZFS issues:
3618 ::zio dcmd does not show timestamp data

MFC after: 3 weeks


248574 21-Mar-2013 smh

Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

- Free ZIOs are registered with the TXG from the ZIO itself, not the
current SPA syncing TXG (which may be out of date);
- L2ARC are registered with a zero TXG number, as L2ARC has no concept
of TXGs;
- The TXG limit for issuing TRIMs is now computed from the last synced
TXG, not the currently syncing TXG. Indeed, under extremely unlikely
race conditions, there is a risk we could trim blocks which have been
freed in a TXG that has not finished syncing, resulting in potential
data corruption in case of a crash.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/5b46ad40d9081d75505d6f3bf04ac652445df366
MFC after: 2 weeks


248572 21-Mar-2013 smh

Add TRIM support for L2ARC.

This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc
MFC after: 2 weeks


248571 21-Mar-2013 mm

Merge libzfs_core branch:
includes MFV 238590, 238592, 247580

MFV 238590, 238592:
In the first zfs ioctl restructuring phase, the libzfs_core library was
introduced. It is a new thin library that wraps around kernel ioctl's.
The idea is to provide a forward-compatible way of dealing with new
features. Arguments are passed in nvlists and not random zfs_cmd fields,
new-style ioctls are logged to pool history using a new method of
history logging.

http://blog.delphix.com/matt/2012/01/17/the-future-of-libzfs/

MFV 247580 [1]:
To address issues of several deadlocks and race conditions the locking
code around dsl_dataset was rewritten and the interface to synctasks
was changed.

User-Visible Changes:
"zfs snapshot" can create more arbitrary snapshots at once (atomically)
"zfs destroy" destroys multiple snapshots at once
"zfs recv" has improved performance

Backward Compatibility:
I have extended the compatibility layer to support full backward
compatibility by remapping or rewriting the responsible ioctl arguments.
Old utilities are fully supported by the new kernel module.

Forward Compatibility:
New utilities work with old kernels with the following restrictions:
- creating, destroying, holding and releasing of multiple snapshots
at once is not supported, this includes recursive (-r) commands

Illumos ZFS issues:
2882 implement libzfs_core
2900 "zfs snapshot" should be able to create multiple,
arbitrary snapshots at once
3464 zfs synctask code needs restructuring

References:
https://www.illumos.org/issues/2882
https://www.illumos.org/issues/2900
https://www.illumos.org/issues/3464 [1]

MFC after: 1 month
Sponsored by: Hybrid Logic Inc. [1]


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


247398 27-Feb-2013 mm

MFV 247176, 247178, 247315:
Import metaslab_sync() speedup from vendor (illumos).

Illumos ZFS issues:
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not
condensing)
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()

References:
https://www.illumos.org/issues/3552
https://www.illumos.org/issues/3564
https://www.illumos.org/issues/3578
https://www.illumos.org/issues/3579

MFC after: 2 weeks


247265 25-Feb-2013 mm

MFV v242732:

Merge the ZFS I/O deadman thread from vendor (illumos).
This feature panics the system on hanging ZFS I/O, helps debugging
and resumes failed service.

The panic behavior can be controlled with the loader-only tunables:
vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O)
vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)

By default, ZFS I/O deadman is enabled by default on amd64 and i386
excluding virtual guest machines.

Illumos ZFS issues:
3246 ZFS I/O deadman thread

References:
https://www.illumos.org/issues/3246

MFC after: 2 weeks


246666 11-Feb-2013 mm

MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
https://www.illumos.org/issues/3498

MFC after: 2 weeks


246651 11-Feb-2013 mm

MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after: 2 weeks


246586 09-Feb-2013 delphij

MFV r245512:

* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible. This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

- On-stack hash table disabled and using kernel slab allocator
instead, at this time. This requires larger kernel thread stack
for zio workers. This may change in the future should we adjusted
the zio workers' thread stack size.
- likely and unlikely will be undefined if they are already defined,
this is required for i386 XEN build.
- Removed De Bruijn sequence based __builtin_ctz family of builtins
in favor of the latter. Both GCC and clang supports these builtins.
- Changed the way the LZ4 code detects endianness.
- Manual pages modifications to mention the feature based on Illumos
counterpart.
- Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from: Illumos (13921:9d721847e469)
Tested on: FreeBSD/amd64
MFC after: 1 month


246531 08-Feb-2013 avg

zfs: update comments about zfid_long_t to match the FreeBSD definitions

MFC after: 1 week


245264 10-Jan-2013 delphij

The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors. However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately. With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from: ZFS on Linux github (e8fd45a0f975c6b8ae8cd644714fc21f14fac2bf)
MFC after: 1 month


244155 12-Dec-2012 smh

Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR: kern/173113
Submitted by: Steven Hartland
Reviewed by: pjd
Approved by: pjd
MFC after: 2 weeks


243524 25-Nov-2012 mm

MFV r243013 and r243267:

Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after: 2 weeks


243520 25-Nov-2012 avg

zfs: overhaul zfs-vfs glue for vnode life-cycle management

* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model. All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit. This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim. This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start. This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested. Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 6 weeks


243503 25-Nov-2012 mm

MFV r242735:

Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after: 2 weeks


242845 10-Nov-2012 delphij

MFV r242729 (mm):

Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable


241286 06-Oct-2012 avg

zfs_mount: taste geom providers for root pool config

This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.

If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.

Discussed with: gibbs, pjd, des
MFC after: 25 days


240955 26-Sep-2012 mm

Merge recent vendor changes in ZFS.

Illumos issued covered:
2811 missing implementation: zfs send -r
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting

References:
https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


240868 23-Sep-2012 pjd

Add TRIM support.

The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by: multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from: zfsonlinux
Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>


240631 18-Sep-2012 avg

zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line

Discussed with: pjd
MFC after: 10 days


240133 05-Sep-2012 mm

Merge recent vendor changes and sync code:
1862 incremental zfs receive fails for sparse file > 8PB
3112 ztest does not honor ZFS_DEBUG
3122 zfs destroy filesystem should prefetch blocks
3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

References:
https://www.illumos.org/issues/1862
https://www.illumos.org/issues/3112
https://www.illumos.org/issues/3122
https://www.illumos.org/issues/3129
https://www.illumos.org/issues/3130

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239774 28-Aug-2012 mm

Merge recent vendor changes:
3100 zvol rename fails with EBUSY when dirty
3104 eliminate empty bpobjs
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

References:
https://www.illumos.org/issues/3100
https://www.illumos.org/issues/3104
https://www.illumos.org/issues/3120

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239620 23-Aug-2012 mm

Merge recent vendor changes:
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Referenes:
https://www.illumos.org/issues/3086
https://www.illumos.org/issues/3090
https://www.illumos.org/issues/3102

PR: kern/170912, kern/170914
Obtained from: illumos (changeset #13776, #13777)
MFC after: 2 weeks


238926 30-Jul-2012 mm

Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags

References:
https://www.illumos.org/issues/2762

MFC after: 2 weeks


238113 04-Jul-2012 pjd

vdev_io_done stage is not used for ioctls.

MFC after: 1 week


236884 11-Jun-2012 mm

Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags

References:
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747

Obtained from: illumos (issue #2619, #2747)
MFC after: 1 month


236155 27-May-2012 mm

Import illumos changeset 13570:3411fd5f1589
1948 zpool list should show more detailed pool information

Display per-vdev information with "zpool list -v".
The added expandsize property has currently no value on FreeBSD.
This changeset allows adding expansion support to individual vdevs
in the future.

References:
https://www.illumos.org/issues/1948

Obtained from: illumos (issue #1948)
MFC after: 2 weeks


235222 10-May-2012 mm

Import illumos changeset 13686:4bc0783f6064
2703 add mechanism to report ZFS send progress

If the zfs send command is used with the -v flag, the amount of bytes
transmitted is reported in per second updates.

References:
https://www.illumos.org/issues/2703

Obtained from: illumos (issue #2703)
MFC after: 2 weeks


230514 24-Jan-2012 mm

Merge illumos revisions 13572, 13573, 13574:

Rev. 13572:
disk sync write perf regression when slog is used post oi_148 [1]

Rev. 13573:
crash during reguid causes stale config [2]
allow and unallow missing from zpool history since removal of pyzfs [5]

Rev. 13574:
leaking a vdev when removing an l2cache device [3]
memory leak when adding a file-based l2arc device [4]
leak in ZFS from metaslab_group_create and zfs_ereport_checksum [6]

References:
https://www.illumos.org/issues/1909 [1]
https://www.illumos.org/issues/1949 [2]
https://www.illumos.org/issues/1951 [3]
https://www.illumos.org/issues/1952 [4]
https://www.illumos.org/issues/1953 [5]
https://www.illumos.org/issues/1954 [6]

Obtained from: illumos (issues #1909, #1949, #1951, #1952, #1953, #1954)
MFC after: 2 weeks


230438 21-Jan-2012 pjd

Dramatically optimize listing snapshots when user requests only snapshot
names and wants to sort them by name, ie. when executes:

# zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot properties.

Below you can find how long does it take to list 34509 snapshots from a single
disk pool before and after this change with cold and warm cache:

before:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s

after:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s

MFC after: 1 week


228103 28-Nov-2011 mm

Merge new ZFS features from illumos:

1644 add ZFS "clones" property
https://www.illumos.org/issues/1644

1645 add ZFS "written" and "written@..." properties
https://www.illumos.org/issues/1645

1646 "zfs send" should estimate size of stream
https://www.illumos.org/issues/1646

1647 "zfs destroy" should determine space reclaimed by destroying multiple
snapshots
https://www.illumos.org/issues/1647

1693 persistent 'comment' field for a zpool
https://www.illumos.org/issues/1693

1708 adjust size of zpool history data
https://www.illumos.org/issues/1708

1748 desire support for reguid in zfs
https://www.illumos.org/issues/1748

Obtained from: illumos (changesets 13514, 13524, 13525)
MFC after: 1 month


226707 24-Oct-2011 pjd

- Use better naming now that we allow to rename any mounted file system (not
only legacy).
- Update copyright to include myself.

MFC after: 2 weeks


226676 24-Oct-2011 pjd

Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.

This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.

MFC after: 2 weeks


224791 12-Aug-2011 pjd

Eliminate the zfsdev_state_lock entirely and replace it with the
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.

Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week


224251 21-Jul-2011 delphij

A different implementation of r224231 proposed by pjd@,
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().

Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)


224231 20-Jul-2011 delphij

Add a new field to in-core znode, z_rdev, to represent device nodes.

PR: kern/159010
Reviewed by: mm@
Approved by: re (kib)
MFC after: 2 weeks


224177 18-Jul-2011 mm

ZFS tries to allocate blocks evenly across all devices. This means when
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.

New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)

Illumos-gate changeset: 13379:4df42cc92254

Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks


224174 18-Jul-2011 mm

Resurrect the ZFS "aclmode" property
Change default of "aclmode" to "discard".

Illumos-gate changeset: 13370:8c04143bd318

Obtained from: Illumos (Feature #742)
MFC after: 2 weeks


222950 10-Jun-2011 gibbs

Remove C constructs that are incompatible with C++ from various
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.

Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.

cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.

sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.

Sponsored by: Spectra Logic Corporation


222267 24-May-2011 pjd

Don't access task structure once we call task function.
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.

Submitted by: anonymous
MFC after: 2 weeks


221263 30-Apr-2011 mm

Fix deduplicated zfs receive
(dmu_recv_stream builds incomplete guid_to_ds_map)

Illumos-gate changeset: 13329:c48b8bf84ab7
MFC together with v28

Approved by: pjd
Obtained from: Illumos (Bug #755)


219320 06-Mar-2011 pjd

Fix libzpool build.

MFC after: 1 month


219317 05-Mar-2011 pjd

Make renaming of a ZVOL, ZVOL's parent directory and ZVOL snapshot work.

Reported by: avg
MFC after: 1 month


219089 27-Feb-2011 pjd

Finally... Import the latest open-source ZFS version - (SPA) 28.

Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.

MFC after: 1 month


216919 03-Jan-2011 mm

MFp4 r186485, r186859:

Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.

Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days


213198 27-Sep-2010 mm

Properly handle IO with B_FAILFAST
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted

OpenSolaris revision and Bug IDs:

9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843014)
MFC after: 3 weeks


213197 27-Sep-2010 mm

Enable offlining of log devices.

OpenSolaris revision and Bug IDs:

9701:cc5b64682e64
6803605 should be able to offline log devices
6726045 vdev_deflate_ratio is not set when offlining a log device
6599442 zpool import has faults in the display

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after: 3 weeks


212694 15-Sep-2010 mm

Fix kernel panic when moving a file to .zfs/shares
Fix possible loss of correct error return code in ZFS mount

OpenSolaris revisions and Bug IDs:

11824:53128e5db7cf
6863610 ZFS mount can lose correct error return

12079:13822b941977
6939941 problem with moving files in zfs (142901-12)

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days


211932 28-Aug-2010 mm

Import changes from OpenSolaris that provide
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes

Detailed information (OpenSolaris onnv changesets and Bug IDs):

9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer

9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership

9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters

10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission

10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir

10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)

10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL

10295:f7a18a1e9610
6870564 panic in zfs_getsecattr

Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks


211931 28-Aug-2010 mm

Update ZFS metaslab code from OpenSolaris.
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.

Detailed information (OpenSolaris onnv changesets and Bug IDs):

11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently

11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics

12047:7c1fcc8419ca
6917066 zfs block picking can be improved

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks


209962 13-Jul-2010 mm

Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced
in Solaris 10 updates 141445-09 and 142901-14.

Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)

7844:effed23820ae
6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)

7897:e520d8258820
6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)

7965:b795da521357
6740164 zpool attach can create an illegal root pool (141909-02)

8084:b811cc60d650
6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)

8121:7fd09d4ebd9c
6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)

8129:e4f45a0bfbb0
6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)

8188:fd00c0a81e80
6761100 want zdb option to select older uberblocks (141445-01)

8190:6eeea43ced42
6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)

8225:59a9961c2aeb
6737463 panic while trying to write out config file if root pool import fails (141445-01)

8227:f7d7be9b1f56
6765294 Refactor replay (141445-01)

8228:51e9ca9ee3a5
6572357 libzfs should do more to avoid mnttab lookups (141909-01)
6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)

8241:5a60f16123ba
6328632 zpool offline is a bit too conservative (141445-01)
6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698 checksum failures after offline -t / export / import / scrub (141445-01)
6745863 ZFS writes to disk after it has been offlined (141445-01)
6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107 I/O should never suspend during spa_load() (141445-01)
6776548 codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)

8242:e46e4b2f0a03
6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)

8269:03a7e9050cfd
6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194 "zfs get" VALUE column is as wide as NAME (141445-01)
6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518 ASSERT strings shouldn't be pre-processed (141445-01)

8274:846b39508aff
6713916 scrub/resilver needlessly decompress data (141445-01)

8343:655db2375fed
6739553 libzfs_status msgid table is out of sync (141445-01)
6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108 zfs_realloc() should not free original memory on failure (141445-01)

8525:e0e0e525d0f8
6788830 set large value to reservation cause core dump (141445-01)
6791064 want sysevents for ZFS scrub (141445-01)
6791066 need to be able to set cachefile on faulted pools (141445-01)
6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)

8547:bcc7b46e5ff7
6792884 Vista clients cannot access .zfs (141445-01)

8632:36ef517870a3
6798384 It can take a village to raise a zio (141445-01)

8636:7e4ce9158df3
6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)

8692:692d4668b40d
6801507 ZFS read aggregation should not mind the gap (141445-01)

8697:e62d2612c14d
6633095 creating a filesystem with many properties set is slow (141445-01)

8768:dfecfdbb27ed
6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)

8811:f8deccf701cf
6790687 libzfs mnttab caching ignores external changes (141445-01)
6791101 memory leak from libzfs_mnttab_init (141445-01)

8845:91af0d9c0790
6800942 smb_session_create() incorrectly stores IP addresses (N/A)
6582163 Access Control List (ACL) for shares (141445-01)
6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184 Panic at smb_oplock_conflict+0x35() (N/A)

8876:59d2e67b4b65
6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)

8924:5af812f84759
6789318 coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)

9030:243fd360d81f
6815893 hang mounting a dataset after booting into a new boot environment (141445-01)

9056:826e1858a846
6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)

9179:d8fbd96b79b3
6790064 zfs needs to determine uid and gid earlier in create process (141445-01)

9214:8d350e5d04aa
6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)

9229:e3f8b41e5db4
6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)

9230:e4561e3eb1ef
6821169 offlining a device results in checksum errors (141445-01)
6821170 ZFS should not increment error stats for unavailable devices (141445-01)
6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)

9234:bffdc4fc05c4
6792139 recovering from a suspended pool needs some work (141445-01)
6794830 reboot command hangs on a failed zfs pool (141445-01)

9246:67c03c93c071
6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)

9276:a8a7fc849933
6816124 System crash running zpool destroy on broken zpool (141445-03)

9355:09928982c591
6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)

9391:413d0661ef33
6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)

9396:f41cf682d0d3 (part already merged)
6501037 want user/group quotas on ZFS (141445-03)
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)

9404:319573cd93f8
6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)

9412:4aefd8704ce0
6717022 ZFS DMU needs zero-copy support (141445-03)

9425:e7ffacaec3a8
6799895 spa_add_spares() needs to be protected by config lock (141445-03)
6826466 want to post sysevents on hot spare activation (141445-03)
6826468 spa 'allowfaulted' needs some work (141445-03)
6826469 kernel support for storing vdev FRU information (141445-03)
6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471 I/O errors after device remove probe can confuse FMA (141445-03)
6826472 spares should enjoy some of the benefits of cache devices (141445-03)

9443:2a96d8478e95
6833711 gang leaders shouldn't have to be logical (141445-03)

9463:d0bd231c7518
6764124 want zdb to be able to checksum metadata blocks only (141445-03)

9465:8372081b8019
6830237 zfs panic in zfs_groupmember() (141445-03)

9466:1fdfd1fed9c4
6833162 phantom log device in zpool status (141445-03)

9469:4f68f041ddcd
6824968 add ZFS userquota support to rquotad (141445-03)

9470:6d827468d7b5
6834217 godfather I/O should reexecute (141445-03)

9480:fcff33da767f
6596237 Stop looking and start ganging (141909-02)

9493:9933d599bc93
6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)

9512:64cafcbcc337
6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)

9515:d3b739d9d043
6586537 async zio taskqs can block out userland commands (142901-09)

9554:787363635b6a
6836768 zfs_userspace() callback has no way to indicate failure (N/A)

9574:1eb6a6ab2c57
6838062 zfs panics when an error is encountered in space_map_load() (141909-02)

9583:b0696cd037cc
6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)

9630:e25a03f552e0
6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)

9653:a70048a304d1
6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)

9688:127be1845343
6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069 zfs get userused@S-1-... doesn't work (N/A)

9873:8ddc892eca6e
6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)

9904:d260bd3fd47c
6838344 kernel heap corruption detected on zil while stress testing (141445-06)

9951:a4895b3dd543
6844900 zfs_ioc_userspace_upgrade leaks (N/A)

10040:38b25aeeaf7a
6857012 zfs panics on zpool import (141445-06)

10000:241a51d8720c
6848242 zdb -e no longer works as expected (N/A)

10100:4a6965f6bef8
6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)

10160:a45b03783d44
6861983 zfs should use new name <-> SID interfaces (N/A)
6862984 userquota commands can hang (141445-06)

10299:80845694147f
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)

10302:a9e3d1987706
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)

10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)

10800:469478b180d9
6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)

10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)

10810:b6b161a6ae4a
6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)

10890:499786962772
6807339 spurious checksum errors when replacing a vdev (142901-13)

11249:6c30f7dfc97b
6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)

11454:6e69bacc1a5a
6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)

11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)

Discussed with: pjd
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 months


208373 21-May-2010 mm

Update L2ARC code and fix several bugs.

- improve ARC memory consumption (Bug ID 6488341)
- ARC/L2ARC metadata accounting (Bug ID 6748019)
- L2ARC turbo warmup (Bud ID 6748023)
- kstats for ARC content (Bug ID 6748023)
- kstats for evicted bytes from ARC by L2ARC state (Bud ID 6871680)
- fix panic on i386 systems (Bug ID 6821260)

OpenSolaris onnv revisions:
8582:df9361868dbe, 8628:97dcded6e556, 9215:7c4584f76b47,
9274:a10f8bd993c1, 10357:29060492b29d

OpenSolaris Bug IDs:
6748019, 6748023, 6748030, 6488341, 6798268, 6821260, 6790261, 6871680

Approved by: pjd, delphij (mentor)
Obtained from: OpenSlaris (multiple bug IDs)
MFC after: 3 days


208370 21-May-2010 mm

Fix: vdev_reopen() can lead to failed allocations

OpenSolaris onnv-revision: 7980:589f37f25048

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6764914)
MFC after: 3 days


208166 16-May-2010 pjd

Fix userland build by making io_task available only for the kernel and by
providing taskq_dispatch_safe() macro.

MFC after: 1 week


208147 16-May-2010 pjd

Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.

Prodded by: kib
MFC after: 1 week


208131 16-May-2010 mm

Fix deadlock between zfs_dirent_lock and zfs_rmdir

OpenSolaris onnv revision: 11321:506b7043a14c

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6847615)
MFC after: 3 days


208130 16-May-2010 mm

Fix perfomance problem with ZFS prefetch caching [1]
Add statistics for ZFS prefetch (sysctl kstat.zfs.misc.zfetchstats)

Partial import of OpenSolaris onnv revision 10474:0e96dd3b905a

Reported by: jhell@dataix.net (private e-mail) [1]
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6859997, 6868951)
MFC after: 3 days


208047 13-May-2010 mm

Import OpenSolaris revision 7837:001de5627df3
It includes the following changes:
- parallel reads in traversal code (Bug ID 6333409)
- faster traversal for zfs send (Bug ID 6418042)
- traversal code cleanup (Bug ID 6725675)
- fix for two scrub related bugs (Bug ID 6729696, 6730101)
- fix assertion in dbuf_verify (Bug ID 6752226)
- fix panic during zfs send with i/o errors (Bug ID 6577985)
- replace P2CROSS with P2BOUNDARY (Bug ID 6725680)

List of OpenSolaris Bug IDs:
6333409, 6418042, 6757112, 6725668, 6725675, 6725680,
6725698, 6729696, 6730101, 6752226, 6577985, 6755042

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 1 week


207670 05-May-2010 mm

Introduce hardforce export option (-F) for "zpool export".
When exporting with this flag, zpool.cache remains untouched.

OpenSolaris onnv revision: 8211:32722be6ad3b

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID: 6775357)


207626 04-May-2010 mm

Speed up ZFS list operation with objset prefetching.

Partial import of OpenSolaris onnv revisions:
8415:8809e849f63e, 10474:0e96dd3b905a

PR: kern/146297
Submitted by: myself
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6386929, 6755389, 6847118)
MFC after: 2 weeks


206797 18-Apr-2010 pjd

Restore previous order.


205231 16-Mar-2010 kmacy

- reduce contention by breaking up ARC state locks in to 16 for data
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)


201143 28-Dec-2009 delphij

Apply OpenSolaris revision 8012 which brings our zpool to version 14,
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.

PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks


200726 19-Dec-2009 delphij

Apply fix for Solaris bug 6801979: zfs recv can fail with E2BIG
(onnv revision 8986)

Requested by: mm
Submitted by: pjd
Obtained from: OpenSolaris
MFC after: 2 weeks


200724 19-Dec-2009 delphij

Apply fix Solaris bug 6462803 zfs snapshot -r failed because
filesystem was busy.

Submitted by: mm
Approved by: pjd
MFC after: 2 weeks


197497 25-Sep-2009 pjd

Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to
be a bit weak and OpenSolaris also switched to fletcher4.

PR: kern/139072
Reported by: Daniel Grund <bugs@dgrund.de>
MFC after: 3 days


197459 24-Sep-2009 pjd

Before calling vflush(FORCECLOSE) mark file system as unmounted so the
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.

The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.

MFC after: 3 days


197153 13-Sep-2009 pjd

When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.

MFC after: 3 days


196703 31-Aug-2009 pjd

Backport the 'dirtying dbuf' panic fix from newer ZFS version.

Reported by: Thomas Backman <serenity@exscape.org>
MFC after: 1 week


196307 17-Aug-2009 pjd

Manage asynchronous vnode release just like Solaris.

Discussed with: kmacy
Approved by: re (kib)


196295 17-Aug-2009 pjd

Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by: re (kib)


195909 27-Jul-2009 pjd

We don't support ephemeral IDs in FreeBSD and without this fix ZFS can
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.

PR: kern/132337
Submitted by: Matthew West <freebsd@r.zeeb.org>
Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by: re (kib)
MFC after: 2 weeks


194043 11-Jun-2009 kmacy

pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous
differences with Opensolaris' ZFS

Sorry for the churn


193980 11-Jun-2009 kmacy

check against prefetch_enable


192800 26-May-2009 trasz

MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by: pjd


185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


179310 25-May-2008 pjd

Fix namespace collision after src/sys/sys/file.h:1.78.


178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


174049 28-Nov-2007 jb

* Check endianness the FreeBSD way.

* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
variable.


168676 12-Apr-2007 pjd

MFp4: Synchronize with vendor (mostly 'zfs rename -r').


168566 10-Apr-2007 pjd

Try to stabilize ZFS with regard to memory consumption:
- Allow to shrink ARC down to 16MB (instead of 64MB).
- Set arc_max to 1/2 of kmem_map by default.
- Start freeing things earlier when low memory situation is detected.
- Serialize execution of arc_lowmem().

I decided to setup minimum ZFS memory requirements to 512MB of RAM and 256MB of
kmem_map size. If there is less RAM or kmem_map, a warning will be printed.
World is cruel, be no better. In other words: modern file system requires
modern hardware:)

From ZFS administration guide:

"Currently the minimum amount of memory recommended to install a Solaris
system is 512 Mbytes. However, for good ZFS performance, at least one
Gbyte or more of memory is recommended."


168498 08-Apr-2007 pjd

MFp4: Synchronize with recent OpenSolaris changes.


168404 06-Apr-2007 pjd

Please welcome ZFS - The last word in file systems.

ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.

I'd like to thank all SUN developers that created this great piece of software.

Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)