History log of /freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
Revision Date Author Comments
# 339324 12-Oct-2018 mav

MFC r339197: Add sysctls for dbuf metadata cache variables added in r336959.


# 339140 03-Oct-2018 mav

MFC r337211: MFV r337210: 9577 remove zfs_dbuf_evict_key tsd

The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


# 339114 03-Oct-2018 mav

MFC r337025: MFV r337022:
9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error

This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.

illumos/illumos-gate@fa98e487a9619b7902f218663be219e787a57dad

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 339109 03-Oct-2018 mav

MFC r336959: MFV r336958: 9337 zfs get all is slow due to uncached metadata

This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>


# 339034 01-Oct-2018 sef

MFC r334844, r336180, r336458

r334844

This originated from ZFS On Linux, as
https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

r336180

Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges. One of the missing changes
included a notification on scrub cancellation.

r336458

Fix a couple of typos in r334844 noticed by Richard Kojedzinszky

Approved by: mav
Sponsored by: iXsystems, Inc


# 332785 19-Apr-2018 mav

MFC r332523: 9433 Fix ARC hit rate

When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>


# 332552 16-Apr-2018 mav

MFC r331711: MFV 331710:
9188 increase size of dbuf cache to reduce indirect block decompression

illumos/illumos-gate@268bbb2a2fa79c36d4695d13a595ba50a7754b76

With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)

In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>


# 332540 16-Apr-2018 mav

MFC r331404: MFV r331400:
8484 Implement aggregate sum and use for arc counters

In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>


# 332525 16-Apr-2018 mav

MFC r329732: MFV r329502: 7614 zfs device evacuation/removal

illumos/illumos-gate@5cabbc6b49070407fb9610cfe73d4c0e0dea3e77

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>


# 331399 22-Mar-2018 mav

MFC r329694: MFV r324198: 8081 Compiler warnings in zdb

illumos/illumos-gate@3f7978d02b206a6ebc5652c91aa9f42da6fbe00c
https://github.com/illumos/illumos-gate/commit/3f7978d02b206a6ebc5652c91aa9f42da6fbe00c

https://www.illumos.org/issues/8081
zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
which uses -WError, the only way to build it is to disable all compiler
warnings. This makes it much harder to detect newly introduced bugs. We should
cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>


# 330824 13-Mar-2018 mav

MFC r330048: Add sysctls/tunables for dbuf cache size.


# 325931 17-Nov-2017 avg

MFC r325610: MFV r325609: 7531 Assign correct flags to prefetched buffers


# 323750 19-Sep-2017 avg

MFC r322237: MFV r322236: 8126 ztest assertion failed in dbuf_dirty due to dn_nlevels changing

illumos/illumos-gate@dcb6872c565819ac88acbc2ece999ef241c8b982
https://github.com/illumos/illumos-gate/commit/dcb6872c565819ac88acbc2ece999ef241c8b982

https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321610 27-Jul-2017 mav

MFC r320156, r320185, r320186, r320262, r320452, r321111:
MFV r318946: 8021 ARC buf data scatter-ization

illumos/illumos-gate@770499e185d15678ccb0be57ebc626ad18d93383
https://github.com/illumos/illumos-gate/commit/770499e185d15678ccb0be57ebc626ad1
8d93383

https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linea
r
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)

FreeBSD notes:
- the actual (default) chunk size is 4KB (despite the text above saying 1KB)
- we can try to reimplement ABDs, so that they are not permanently
mapped into the KVA unless explicitly requested, especially on
platforms with scarce KVA
- we can try to use unmapped I/O and avoid intermediate allocation of a
linear, virtual memory mapped buffer
- we can try to avoid extra data copying by referring to chunks / pages
in the original ABD

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>


# 321575 26-Jul-2017 mav

MFC r319750: MFV r319741: 8156 dbuf_evict_notify() does not need dbuf_evict_lock

illumos/illumos-gate@dbfd9f930004c390a2ce2cf850c71b4f880eef9c
https://github.com/illumos/illumos-gate/commit/dbfd9f930004c390a2ce2cf850c71b4f880eef9c

https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321573 26-Jul-2017 mav

MFC r319748: MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers

illumos/illumos-gate@adaec86ad212d9fd756bee322934fa54d1258605
https://github.com/illumos/illumos-gate/commit/adaec86ad212d9fd756bee322934fa54d1258605

https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321568 26-Jul-2017 mav

MFC r318935: MFV r318934: 8070 Add some ZFS comments

illumos/illumos-gate@40713f2b249d289022c715107b3951055a63aef0
https://github.com/illumos/illumos-gate/commit/40713f2b249d289022c715107b3951055a63aef0

https://www.illumos.org/issues/8070
Add some ZFS comments left by various developers at different times

Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>


# 321565 26-Jul-2017 mav

MFC r318928: MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf

illumos/illumos-gate@def4fac5882b4ca67bd0f4a53509b6d1fa8ae14e
https://github.com/illumos/illumos-gate/commit/def4fac5882b4ca67bd0f4a53509b6d1fa8ae14e

https://www.illumos.org/issues/8025
dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321554 26-Jul-2017 mav

MFC r318829: MFV r316920: 8023 Panic destroying a metaslab deferred range tree

illumos/illumos-gate@3991b535a8e990c0369be677746a87c259b13e9f
https://github.com/illumos/illumos-gate/commit/3991b535a8e990c0369be677746a87c259b13e9f

https://www.illumos.org/issues/8023
$C
ffffff0011bc0970 vpanic()
ffffff0011bc0a00 strlog()
ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00)
ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380)
ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800)
ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000)
ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0)
ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000)
ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000)
ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190)
ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>


# 321553 26-Jul-2017 mav

MFC r318828: MFV r316917: 7968 multi-threaded spa_sync()

illumos/illumos-gate@94c2d0eb22e9624151ee84a7edbf7178e1bf4087
https://github.com/illumos/illumos-gate/commit/94c2d0eb22e9624151ee84a7edbf7178e1bf4087

https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling
dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321552 26-Jul-2017 mav

MFC r318827: MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists

illumos/illumos-gate@10fbdecb05f411234920f8d3c92c148d39106d7e
https://github.com/illumos/illumos-gate/commit/10fbdecb05f411234920f8d3c92c148d39106d7e

https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321547 26-Jul-2017 mav

MFC r318821: MFV r316912: 7793 ztest fails assertion in dmu_tx_willuse_space

illumos/illumos-gate@61e255ce7267b52208af9daf434b77d37fb75622
https://github.com/illumos/illumos-gate/commit/61e255ce7267b52208af9daf434b77d37
fb75622

https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 321541 26-Jul-2017 mav

MFC r317541: MFV 316905

7740 fix for 6513 only works in hole punching case, not truncation

illumos/illumos-gate@7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553
https://github.com/illumos/illumos-gate/commit/7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553

https://www.illumos.org/issues/7740
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
To verify, create a large file, truncate it to a short length, and then
write beyond the end. Check with zdb to make sure that there are no
holes with birth time zero (will appear as gaps).

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>


# 321537 26-Jul-2017 mav

MFC r317511: MFV 316896

7580 ztest failure in dbuf_read_impl

illumos/illumos-gate@1a01181fdc809f40c64d5c6881ae3e4521a9d9c7
https://github.com/illumos/illumos-gate/commit/1a01181fdc809f40c64d5c6881ae3e4521a9d9c7

https://www.illumos.org/issues/7580
We need to prevent any reader whenever we're about the zero out all the
blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
dbuf_write_children_ready and free_children just prior to calling bzero.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>


# 321535 26-Jul-2017 mav

MFC r317414: MFV 316894

7252 7628 compressed zfs send / receive

illumos/illumos-gate@5602294fda888d923d57a78bafdaf48ae6223dea
https://github.com/illumos/illumos-gate/commit/5602294fda888d923d57a78bafdaf48ae6223dea

https://www.illumos.org/issues/7252
This feature includes code to allow a system with compressed ARC enabled to
send data in its compressed form straight out of the ARC, and receive data in
its compressed form directly into the ARC.

https://www.illumos.org/issues/7628
We should have longer, more readable versions of the ZFS send / recv options.

7628 create long versions of ZFS send / receive options

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>


# 321527 26-Jul-2017 mav

MFC r314267: MFV 314243

6676 Race between unique_insert() and unique_remove() causes ZFS fsid change

illumos/illumos-gate@40510e8eba18690b9a9843b26393725eeb0f1dac
https://github.com/illumos/illumos-gate/commit/40510e8eba18690b9a9843b26393725ee
b0f1dac

https://www.illumos.org/issues/6676

The fsid of zfs filesystems might change after reboot or remount. The problem
seems to
be caused by a race between unique_insert() and unique_remove(). The unique_re
move()
is called from dsl_dataset_evict() which is now an asynchronous thread. In a c
ase the
dsl_dataset_evict() thread is very slow and calls unique_remove() too late we
will end
up with changed fsid on zfs mount.

This problem is very likely caused by #5056.

Steps to Reproduce
Note: I'm able to reproduce this always on a single core (virtual) machine. On
multicore
machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

Impact
The persistent fsid (filesystem id) is essential for proper NFS functionality.
If the fsid of a filesystem changes on remount (or after reboot) the NFS
clients might not be able to automatically recover from such event and the
manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>


# 321524 26-Jul-2017 mav

MFC r313813: MFV 313786

7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid

illumos/illumos-gate@653af1b809998570c7e89fe7a0d3f90992bf0216
https://github.com/illumos/illumos-gate/commit/653af1b809998570c7e89fe7a0d3f90992bf0216

https://www.illumos.org/issues/7500
With the integration of:

commit 0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date: Sat Jul 26 13:40:04 2014 -0800
4873 zvol unmap calls can take a very long time for larger datasets

the dnode's dn_bufs field was changed from a list to a tree. As a result,
the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>


# 308082 29-Oct-2016 mav

MFC r306424: MFV r306422:
7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs

dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.

Closes #180

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>


# 307290 14-Oct-2016 mav

MFC r305340: MFC r305337:
7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object

Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489a169ab36f9ec1b2a111a60a5563a9f


# 307286 14-Oct-2016 mav

MFC r305338: MFV r305335: 7003 zap_lockdir() should tag hold

zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5a2c13e04328b39ecd2a6d0d3c4f7cb


# 307269 14-Oct-2016 mav

MFC r305325: MFV r303078:
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer

illumos/illumos-gate@926549256b71acd595f69b236779ff6b78fa08ef
https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef

https://www.illumos.org/issues/7086
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
db_blkptr, to prevent it from being changed by syncing context.
Otherwise we may see that ztest got a segfault from this stack:
libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
ztest_dmu_object_alloc_free+0x71(ea294588, 13)
ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
ztest_execute+0x89(a, 807c720, 13, 0)
ztest_thread+0xea(13, 0, 0, 0)
libc.so.1`_thrp_setup+0x88(f0983240)
libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
Looking into it a bit, we see that this is an embedded blockpointer, so
BP_GET_NDVAS should have returned 0:
b32b240::blkptr
EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
Instead, it looks like another thread is modifying this blockpointer:
b32b240::ugrep | ::whatis
f47a0e0c is in [ stack tid=0x19f ]
ebd6ec40 is in [ stack tid=0x226 ]
ea293bd0 is in [ stack tid=0x244 ]
ea293be4 is in [ stack tid=0x244 ]

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>


# 307265 14-Oct-2016 mav

MFC r305323: MFV r302991: 6950 ARC should cache compressed data

illumos/illumos-gate@dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2
https://github.com/illumos/illumos-gate/commit/dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2

https://www.illumos.org/issues/6950
When reading compressed data from disk, the ARC should keep the compressed
block cached and only decompress it when consumers access the block. The
uncompressed data should be short-lived allowing the ARC to cache a much larger
amount of data. The DMU would also maintain a smaller cache of uncompressed
blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>


# 304138 15-Aug-2016 avg

MFC r302838: 6513 partially filled holes lose birth time


# 304137 15-Aug-2016 avg

MFC r302837: 6844 dnode_next_offset can detect fictional holes