History log of /freebsd-11-stable/sys/cddl/contrib/opensolaris/uts/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
364843 26-Aug-2020 jhb

MFC 361220: Correct the order of arguments to copyin() for Q_SETQUOTA.

363603 27-Jul-2020 markj

MFC r363373:
Fix a memory leak in dsl_scan_visitbp().

PR: 247445

363485 24-Jul-2020 asomers

MFC r362891:

Fix page fault in zfsctl_snapdir_getattr

Must acquire the z_teardown_lock before accessing the zfsvfs_t object. I
can't reproduce this panic on demand, but this looks like the correct
solution.

PR: 247668
Reviewed by: avg
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25543

363098 11-Jul-2020 allanjude

MFC r362396
ZFS: Allow setting checksum=skein on boot pools

PR: 245889
Reported by: delphij
Sponsored by: Klara Inc.
Event: July 2020 Bugathon

361088 15-May-2020 dim

Merge changes that enable DTrace-using ports to link correctly with lld
10, avoiding "unknown relocation 8" and other errors.

MFC r312658 (by markj):

Remove the DTRACEHIOC_ADD ioctl.

This ioctl has been considered legacy by upstream since the DTrace code
was first imported, and is unused. The removal also allows some
simplification of dtrace_helper_slurp().

Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a
bug, it would overwrite an in-memory copy of the DOF header rather than
the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the
helper back out automatically since its argument has the IOC_OUT attribute.

MFC r313262 (by markj):

Use PC-relative relocations for USDT probe sites on i386 and amd64.

When recording probe site addresses in the output DOF file, dtrace -G
needs to emit relocations for the .SUNW_dof section in order to obtain
the addresses of functions containing probe sites. DTrace expects the
addresses to be relative to the base address of the final ELF file,
and the amd64 USDT implementation was relying on some unspecified and
incorrect behaviour in the base system GNU ld to achieve this.

This change reimplements the probe site relocation handling to allow
USDT to be used with lld and newer GNU binutils. Specifically, it
makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the
probe site address relative to the DOF file address, and adds and uses a
new DOF relocation type which computes the final probe site address using
these relative offsets.

Reported by and discussed with: Rafael Esp?ndola
Differential Revision: https://reviews.freebsd.org/D9374

359722 08-Apr-2020 freqlabs

MFC r359303

MFOpenZFS: ZVOLs should not be allowed to have children

zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

Approved by: mav (mentor)
openzfs/zfs@d8d418ff0cc90776182534bce10b01e9487b63e4

359554 02-Apr-2020 mav

MFC r359112: MFOpenZFS: make zil max block size tunable

We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations. The large allocations are for ZIL blocks. If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more. In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
openzfs/zfs@b8738257c2607c73c731ce8e0fd73282b266d6ef

358836 10-Mar-2020 mav

MFC r358580: Increase number of write completion threads, matching ZoL.

Our iSCSI benchmarks on a large 80-core system show that previous limit
of 8 threads can be a bottleneck. At some points this change increases
write IOPS by as much as 50%. I am still not sure that so many threads
is really required, but we tested lower amounts and got no significant
benefits, while latencies were a bit worse, so decided to not diverge.

358608 04-Mar-2020 mav

MFC r358357: MFZoL: Relax restriction on zfs_ioc_next_obj() iteration

Per the documentation for dnode_next_offset in dnode.c, the "txg"
parameter specifies a lower bound on which transaction the dnode can
be found in. We are interested in all dnodes that are removed between
the first and last transaction in the snapshot. It doesn't need to be
created in that snapshot to correspond to a removed file.

In fact, the behavior of zfs diff in the test case exactly matches
this: the transaction that created the data that was deleted in snapshot
"2" was produced before, in snapshot "1", definitely predating the first
transaction in snapshot "2".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com>
Closes #2081
zfsonlinux/zfs@7290cd3c4ed19fb3f75b8133db2e36afcdd24beb

358606 04-Mar-2020 mav

MFC r358342: MFZoL: Fix resilver writes in vdev_indirect_io_start

This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
zfsonlinux/zfs@5aa95ba0d3502779695341b5f55fa5ba1d3330ff

358604 04-Mar-2020 mav

MFC r358339: MFZoL: Fix issue with scanning dedup blocks as scan ends

This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
zfsonlinux/zfs@5e0bd0ae056e26de36dee3c199c6fcff8f14ee15

358602 04-Mar-2020 mav

MFC r358337: MFZoL: Fix 2 small bugs with cached dsl_scan_phys_t

This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
zfsonlinux/zfs@8cb119e3dc0ac6c90b1517fbadc021b7e9741fc6

358600 04-Mar-2020 mav

MFC r358336, r358340: MFZoL: Fix txg_sync_thread hang in scan_exec_io()

When scn->scn_maxinflight_bytes has not been initialized it's
possible to hang on the condition variable in scan_exec_io().
This issue was uncovered by ztest and is only possible when
deduplication is enabled through the following call path.

txg_sync_thread()
spa_sync()
ddt_sync_table()
ddt_sync_entry()
dsl_scan_ddt_entry()
dsl_scan_scrub_cb()
dsl_scan_enqueuei()
scan_exec_io()
cv_wait()

Resolve the issue by always initializing scn_maxinflight_bytes
to a reasonable minimum value. This value will be recalculated
in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit
and the addition/removal of vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7098
zfsonlinux/zfs@f90a30ad1b32a971f62a540f8944e42f99b254ce

358313 25-Feb-2020 mav

MFC r349381: Avoid extra taskq_dispatch() calls by DMU.

DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

357706 09-Feb-2020 kevans

MFC O_SEARCH: r357412, r357461, r357580, r357584, r357636, r357671, r357688

r357412:
Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

r357461:
namei: preserve errors from fget_cap_locked

Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

r357580:
O_SEARCH test: drop O_SEARCH|O_RDWR local diff

In FreeBSD's O_SEARCH implementation, O_SEARCH in conjunction with O_RDWR or
O_WRONLY is explicitly rejected. In this case, O_RDWR was not necessary
anyways as the file will get created with or without it.

This was submitted upstream as misc/54940 and committed in rev 1.8 of the
file.

r357584:
Record-only MFV of r357583: netbsd-tests: import upstreamed changes

The changes in question originated in FreeBSD/head; no further action is
required.

r357636:
MFV r357635: imnport v1.9 of the O_SEARCH tests

The RCSID data was wrong, so this is effectively a record-only merge
with correction of said data. No further changes should be needed in this
area, as we've now upstreamed our local changes to this specific test.

r357671:
O_SEARCH test: mark revokex an expected fail on NFS

The revokex test does not work when the scratch directory is created on NFS.
Given the nature of NFS, it likely can never work without looking like a
security hole since O_SEARCH would rely on the server knowing that the
directory did have +x at the time of open and that it's OK for it to have
been revoked based on POSIX specification for O_SEARCH.

This does mean that O_SEARCH is only partially functional on NFS in general,
but I suspect the execute bit getting revoked in the process is likely not
common.

r357688:
MFV r357687: Import NFS fix for O_SEARCH tests

The version that ended upstream was ultimately slightly different than the
version committed here; notably, statvfs() is used but it's redefined
appropriately to statfs() on FreeBSD since we don't provide the fstypename
for the former interface.

357605 05-Feb-2020 kevans

MFC r357410-r357411: Avoid duplicating VEXEC checks in VOP_CACHEDLOOKUP

r357410: pseudofs: don't do VEXEC check in VOP_CACHEDLOOKUP

VOP_CACHEDLOOKUP should assume that the appropriate VEXEC check has been
done in the caller (vfs_cache_lookup), so it does not belong here.

r357411: zfs: light refactor to indicate cachedlookup in zfs_lookup

If we come from VOP_CACHEDLOOKUP, we must skip the VEXEC check as it will
have been done in the caller (vfs_cache_lookup). This is a part of D23247,
which may skip the earlier VEXEC check as well if the root fd was opened
with O_SEARCH.

This one required slightly more work as zfs_lookup may also be called
indirectly as VOP_LOOKUP or a couple of other places where we must do the
check.

355637 12-Dec-2019 mav

MFC r355182: Fix use-after-free in case of L2ARC prefetch failure.

In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device. Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.

The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.

355443 06-Dec-2019 kib

MFC r355211:
Add a VN_OPEN_INVFS flag.

354642 12-Nov-2019 mav

MFC r354360: Add vfs.zfs.zio.taskq_batch_pct tunable.

354366 05-Nov-2019 mav

MFC r354159: FreeBSD'fy ZFS zlib zalloc/zfree callbacks.

The previous code came from OpenSolaris, which in my understanding require
allocation size to be known to free memory. To store that size previous
code allocated additional 8 byte header. But I have noticed that zlib
with present settings allocates 64KB context buffers for each call, that
could be efficiently cached by UMA, but addition of those 8 bytes makes
them fall back to physical RAM allocations, that cause huge overhead and
lock congestion on small blocks. Since FreeBSD's free() does not have
the size argument, switching to it solves the problem, increasing write
speed to ZVOLs with 4KB block size and GZIP compression on my 40-threads
test system from ~60MB/s to ~600MB/s.

354026 24-Oct-2019 avg

MFC r353614: MFV r353613: 10731 zfs: NULL pointer errors

353759 19-Oct-2019 avg

MFC r353037: ZFS: add bookmark renaming

353583 15-Oct-2019 mav

MFC r352939: Improve latency of synchronous 128KB writes.

Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~124KB+/4KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
imbalancing checksum computation time between parallel threads.

This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 10 requested block sizes.

On a simple 128KB write test with queue depth of 1 this change demonstrates
~15-20% performance improvement.

353339 09-Oct-2019 avg

MFC r352591: MFZoL: Retire send space estimation via ZFS_IOC_SEND

Add a small wrapper around libzfs_core's lzc_send_space() to libzfs so
that every legacy ZFS_IOC_SEND consumer, along with their userland
counterpart estimate_ioctl(), can leverage ZFS_IOC_SEND_SPACE to
request send space estimation.

The legacy functionality in zfs_ioc_send() is left untouched for
compatibility purposes.

Obtained from: ZoL
Obtained from: zfsonlinux/zfs@cf7684bc8d57
Author: loli10K <ezomori.nozomu@gmail.com>

352724 25-Sep-2019 avg

MFC r352506: fix dsl_scan_ds_clone_swapped logic

PR: 239566

352687 25-Sep-2019 mav

MFC r352493: Fix typo, setting hidden flag instead of reparse.

352376 16-Sep-2019 avg

MFC r351803: ZFS: Always refuse receving non-resume stream when resume state exists

351933 06-Sep-2019 avg

MFC r351593: zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets

351807 04-Sep-2019 avg

MFC r351168: zfs_vget: fix vnode reference count leak in error path

349216 19-Jun-2019 avg

MFC r348772: Restore ARC MFU/MRU pressure

Submitted by: Slawa Olhovchenkov <slw@zxy.spb.ru>
Sponsored by: Integros [integros.com]

347049 03-May-2019 mav

MFC r346762: Add mutex_destroy() missed in r334844.

347047 03-May-2019 mav

MFC r346760: Fix minor mismerges.

No functional change.

346686 25-Apr-2019 mav

MFC r345200: MFV r336930: 9284 arc_reclaim_thread has 2 jobs

`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.

The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.

The fix is to use seperate threads to:

1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`

2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.

illumos/illumos-gate@de753e34f9c399037936e8bc547d823bba9d4b0d

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>

346684 25-Apr-2019 mav

MFC r340311: Do not ignore arc_adjust() return value.

This covers scenario when ARC may not shrink as fast as it could:
1. arc_size < arc_c and arc_adjust() does not evict anything, returning
zero to arc_reclaim_thread();
2. arc_available_memory() reports memory pressure, which can not be
satisfied by arc_kmem_reap_now();
3. arc_shrink() reduces arc_c and calls arc_adjust(), return of which is
ignored;
4. even if the last arc_adjust() could not satisfy arc_size < arc_c,
arc_reclaim_thread() will still go to sleep, since the first one
returned zero.

346681 25-Apr-2019 mav

MFC r339298 (by allanjude):
Add missing sysctls for tuning vdev queue depths for new I/O types

This connects new tunables that were added but not exposed in:
r329502 (zpool remove)
r337007 (zpool initialize)

346679 25-Apr-2019 mav

MFC r339009 (by allanjude):
Avoid panic when adjusting priority of a read in the face of an IO error

PR: 231516
Reported by: sbruno
Approved by: re (rgrimes)
Obtained from: ZFS-on-Linux
X-MFC-with: 334844
Sponsored by: Klara Systems

MFV/ZoL: Fix zio->io_priority failed (7 < 6) assert

commit c26cf0966d131b722c32f8ccecfe5791a789d975
Author: Tony Hutter <hutter2@llnl.gov>
Date: Tue May 29 18:13:48 2018 -0700

Fix zio->io_priority failed (7 < 6) assert

This fixes an assert in vdev_queue_change_io_priority():

VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>

346676 25-Apr-2019 mav

MFC r337594 (by mmacy):
ZFS/MFV: Use cached feature info in spa_add_feature_stats()

commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date: Thu Feb 26 12:24:11 2015 -0800

Use cached feature info in spa_add_feature_stats()

Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082

346131 11-Apr-2019 mav

MFC r344936: MFV/ZoL: Disable LBA weighting on files and SSDs

The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f0853946f8150481ca22602d1334dfe

To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.

346129 11-Apr-2019 mav

MFC r344934, r345014: Add separate aggregation limit for non-rotating media.

Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.

Sponsored by: iXsystems, Inc.

346127 11-Apr-2019 mav

MFC r344926:
MFV/ZoL: Fix zfs_vdev_aggregation_limit bounds checking

Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f53687ac6d1dee21f60de41b2552a43201

MFV/ZoL: Cap maximum aggregate IO size

Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259
Closes #6270
zfsonlinux/zfs@2d678f779aba26a93314c8ee1142c3985fa25cb6

346032 08-Apr-2019 sjg

Add _PC_ACL_* to vop_stdpathconf

This avoid EINVAL from tmpfs etc.

Merge of r345024

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D19512

345123 14-Mar-2019 mav

MFC r344903: Improve entropy for ZFS taskqueue selection.

I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function. In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.

345121 14-Mar-2019 mav

MFC r344866: Add respective tunables to few ZFS sysctls.

345069 12-Mar-2019 mav

MFC r339299: Pull in a follow-on commit to resolve a deadlock in ZFS
sequential resilver (r334844)

MFV/ZoL: Fix deadlock in IO pipeline

commit a76f3d0437e5e974f0f748f8735af3539443b388
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Fri Mar 16 16:46:06 2018 -0700

Fix deadlock in IO pipeline

In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock. This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed. This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

--- THREAD 1 ---
arc_read()
zio_nowait()
zio_vdev_io_start()
vdev_queue_io() <--- mutex_enter(vq->vq_lock)
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
zio_vdev_io_assess()
zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
zio_change_priority() <- mutex_enter(zio->zio_lock)
vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

344076 13-Feb-2019 mav

MFC r343586: Remove BIO_ORDERED flag from BIO_FLUSH sent by ZFS.

In all cases where ZFS sends BIO_FLUSH, it first waits for all related
writes to complete, so its BIO_FLUSH does not care about strict ordering.
Removal of one makes life much easier at least for NVMe driver, which
hardware has no concept of request ordering, relying completely on software.

344005 11-Feb-2019 mav

MFC r343745, r343752: Add missed tunables/sysctls for some new vdev variables.

While there, make few existing sysctls writeable, since there is no reason
not to.

343983 10-Feb-2019 oshogbo

MFC r343470:
zfs: allow to change cache flush sysctl

There is no reason for this variable to be tunable.
This variable is used as a barrier in few places.

Discussed with: pjd
MFC after: 2 weeks
Sponsored by: Fudo Security

343624 31-Jan-2019 sef

MFC r342928:
Change ZFS quotas to return EINVAL when not present (matches man page).

Approved by: mav
Reviewed by: markj
PR: 234413
Sponsored by: iXsystems Inc
Reported by: Emrion <kmachine@free.fr>

342943 11-Jan-2019 avg

MFC r342525: MFV r342469: 9630 add lzc_rename and lzc_destroy to libzfs_core

Relnotes: maybe

342941 11-Jan-2019 avg

MFC r342541: MFV r342532: 5882 Temporary pool names

Note that this commit brings only formatting changes that were done
during the final review of the illumos change, because FreeBSD got the
main changes before illumos.

341828 11-Dec-2018 allanjude

MFC: r339289: Resolve a hang in ZFS during vnode reclaimation

This is caused by a deadlock between zil_commit() and zfs_zget()
Add a way for zfs_zget() to break out of the retry loop in the common case

PR: 229614, 231117
Reported by: grembo, jhb, Andreas Sommer, others
Relnotes: yes
Sponsored by: Klara Systems

341074 27-Nov-2018 markj

MFC r340856:
Ensure that directory entry padding bytes are zeroed.

339441 19-Oct-2018 mav

MFC r339372: Skip VDEV_IO_DONE stage only for ZIO_TYPE_FREE.

Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent,
that never happened before. It confused FreeBSD-specific TRIM code,
which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs. As
result of that stage being skipped device removal ZIOs leaked references
and memory that supposed to be freed by VDEV_IO_DONE, making it stuck.

It is a quick patch rather then a nice fix, but hopefully we'll be able
to drop it all together when alternative TRIM implementation finally get
landed.

PR: 228750, 229007

339440 19-Oct-2018 mav

MFC r339329: Add ZIO_TYPE_FREE support for indirect vdevs.

Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.

PR: 228750, 229007

339439 19-Oct-2018 mav

MFC r339335: Avoid zero-sized kmem_alloc() in vdev_compact_children().

The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.

This is a part of ZoL port of device removal patch:

commit a1d477c24c7badc89c60955995fd84d311938486
Author: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>

339345 13-Oct-2018 mav

MFC r339288: Remove extra thread_exit() call left after r329802.

spa_condense_indirect_thread() is no longer a thread function, but just
a callback for new zthr KPI.

339324 12-Oct-2018 mav

MFC r339197: Add sysctls for dbuf metadata cache variables added in r336959.

339306 11-Oct-2018 mav

MFC r339237: Fix r336951 mismerge -- use of uninitialized variable.

339158 03-Oct-2018 mav

MFC r337567 (by mmacy):
Performance optimization of AVL tree comparator functions

MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date: Sat Aug 27 20:12:53 2016 +0200

perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old 6.85789 s
new 2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals

perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033

339153 03-Oct-2018 mav

MFC r338869: MFV r338866: 9700 ZFS resilvered mirror does not balance reads

illumos/illumos-gate@82f63c3c2bf5e4378706e8dcfccf717d67371be9

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>

339152 03-Oct-2018 mav

MFC r337972: 9751 Allocation throttling misplacing ditto blocks

Relax allocation throttling for ditto blocks. Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment. Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by: iXsystems, Inc.

339151 03-Oct-2018 mav

MFC r337970: 9738 Fix third block copy allocations, broken at 9112.

Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Reviewed by: Paul Dagnelie <pcd@delphix.com>

339150 03-Oct-2018 mav

MFC r337923: Make vfs.zfs.zio.dva_throttle_enabled sysctl writable.

Not sure what I thought originally, but as I see now runtime changes are
working fine, and the code seems like even designed for this.

339149 03-Oct-2018 mav

MFC r337883: Add couple tunables/sysctl, missed in r336949.

339148 03-Oct-2018 mav

MFC r337870: Fix mismerge in r337196.

ZoL did the same mistake, and fixed it with separate commit 863522b1f9:

dsl_scan_scrub_cb: don't double-account non-embedded blocks

We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:

https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720
Closes #7738

339147 03-Oct-2018 mav

MFC r337229: Reduce taskq and context-switch cost of zio pipe

When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching. We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736

zfsonlinux/zfs@62840030a7dceaee013ddbcc1eebcfc7922edf7c

339146 03-Oct-2018 mav

MFC r337227: MFV r337223:
9580 Add a hash-table on top of nvlist to speed-up operations

illumos/illumos-gate@2ec7644aab2a726a64681fa66c6db8731b160de1

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339142 03-Oct-2018 mav

MFC r337215: MFV 337214:
9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c868afb740396041be80ed2b141065e76

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Josh Paetzel <josh@tcbug.org>

339141 03-Oct-2018 mav

MFC r337213: MFV r337212:
9465 ARC check for 'anon_size > arc_c/2' can stall the system

illumos/illumos-gate@abe1fd01ce5a83718c5a840daeab4abdaec1c104

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Don Brady <don.brady@delphix.com>

339140 03-Oct-2018 mav

MFC r337211: MFV r337210: 9577 remove zfs_dbuf_evict_key tsd

The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339139 03-Oct-2018 mav

MFC r337209:
MFV r337208: 9591 ms_shift can be incorrectly changed in MOS config for
indirect vdevs that have been historically expanded

illumos/illumos-gate@11f6a9680e013a7c9c57dc0b64d3e91e2eee1a6b

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339138 03-Oct-2018 mav

MFC r337207: MFV r337206: 9338 moved dnode has incorrect dn_next_type

illumos/illumos-gate@c7fbe46df966ea665df63b6e6071808987e839d1

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339137 03-Oct-2018 mav

MFC r337205:
MFV r337204: 9439 ZFS double-free due to failure to dirty indirect block

illumos/illumos-gate@99a19144e82244f3426f055cc73af8a937c0135c

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339136 03-Oct-2018 mav

MFC r337202: MFV r337200:
9438 Holes can lose birth time info if a block has a mix of birth times

Ultimately, the problem here is that when you truncate and write a file in
the same transaction group, the dbuf for the indirect block will be zeroed
out to deal with the truncation, and then written for the write. During
this process, we will lose hole birth time information for any holes in the
range. In the case where a dnode is being freed, we need to determine
whether the block should be converted to a higher-level hole in the zio
pipeline, and if so do it when the dnode is being synced out.

illumos/illumos-gate@738e2a3ce3b2579222d6855e7fe75b5bcfcddf8d

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

339135 03-Oct-2018 mav

MFC r337201: Fix build after r337196 mismerge.

339134 03-Oct-2018 mav

MFC r337198: MFV r337197: 9456 ztest failure in zil_commit_waiter_timeout

illumos/illumos-gate@b6031810da58df96413bf76e068638fcab1f228a

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Prakash Surya <prakash.surya@delphix.com>

339133 03-Oct-2018 mav

MFC r337196: MFV r337195: 9454 ::zfs_blkstats should count embedded blocks

illumos/illumos-gate@dec267e7ea9828898b1c64462daa6636c4ef5e29

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339132 03-Oct-2018 mav

MFC r337194: MFV r337193:
9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'f
unction' for key 'error')"

illumos/illumos-gate@fe3ba4d1227d8746116ece7240682b13595c3142

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339131 03-Oct-2018 mav

MFC r337191:
MFV r337190: 9486 reduce memory used by device removal on fragmented pools

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

illumos/illumos-gate@cfd63e1b1bcf7ba4bf72f55ddbd87ce008d2986d

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339129 03-Oct-2018 mav

MFC r337183:
MFV r337182: 9330 stack overflow when creating a deeply nested dataset

Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d61660aa299c287a39277cb0372be959

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>

339128 03-Oct-2018 mav

MFC r337181: 9539 Make zvol operations use _by_dnode routines

Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.

illumos/illumos-gate@8dfe5547fbf0979fc1065a8b6fddc1e940a7cf4f

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Richard Yao <richard.yao@prophetstor.com>

339126 03-Oct-2018 mav

MFC r337177:
MFV r337175: 9487 Free objects when receiving full stream as clone

All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.

zfsonlinux/zfs@48fbb9ddbf2281911560dfbc2821aa8b74127315
illumos/illumos-gate@7864b8192b8d30471fa2240466d516292e5765b8

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

339125 03-Oct-2018 mav

MFC r337172, MFV r337171:
9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes

Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.

illumos/illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339120 03-Oct-2018 mav

MFC r337169: MFV r337167: 9442 decrease indirect block size of spacemaps

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

illumos/illumos-gate@221813c13b43ef48330b03725e00edee85108cf1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339116 03-Oct-2018 mav

MFC r337030: MFV r337029:
9426 metaslab size can exceed offset addressable by spacemap

metaslab size can exceed offset addressable by spacemap. The vdev can
address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to
2^47 * 2^vdev_ashift. Therefore we may need to increase the number of
metaslabs so that the maximum metaslab size is capped at the amount that
can be addressed by the spacemap. This should happen in
vdev_metaslab_set_size().

illumos/illumos-gate@b4bf0cf0458759c67920a031021a9d96cd683cfe

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Don Brady <don.brady@delphix.com>

339115 03-Oct-2018 mav

MFC r337028: MFV r337027:
9328 zap code can take advantage of c99
9329 panic in zap_leaf_lookup() due to concurrent zapification

illumos/illumos-gate@bf26014c5541b6119f34e0d95294b7f2eb105ac2

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339114 03-Oct-2018 mav

MFC r337025: MFV r337022:
9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error

This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.

illumos/illumos-gate@fa98e487a9619b7902f218663be219e787a57dad

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339113 03-Oct-2018 mav

MFC r337021: MFV r337020:9443 panic when scrub a v10 pool

illumos/illumos-gate@bb1f424574ac8e08069d0ba993c2a41ffe796794

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

339112 03-Oct-2018 mav

MFC r337017: MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb425396adaebd0267d29e1026fc4dc413

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

339111 03-Oct-2018 mav

MFC r337007: MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>

339110 03-Oct-2018 mav

MFC r336961:
MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets

illumos/illummos-gate@df477c0afa111b5205c872dab36dbfde391656de

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>

339109 03-Oct-2018 mav

MFC r336959: MFV r336958: 9337 zfs get all is slow due to uncached metadata

This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339108 03-Oct-2018 mav

MFC r336956: MFV r336955: 9236 nuke spa_dbgmsg

We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339107 03-Oct-2018 mav

MFC r336954:
MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync

Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

illumos/illumos-gate@a3b5583021b7b45676bf1f0cc68adf7a97900b56

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339106 03-Oct-2018 mav

MFC r336951: MFV r336950: 9290 device removal reduces redundancy of mirrors

Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

339105 03-Oct-2018 mav

MFC r336949:
MFV r336948: 9112 Improve allocation performance on high-end systems

On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.

illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>

339104 03-Oct-2018 mav

MFC r336947: MFV r336946: 9238 ZFS Spacemap Encoding V2

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

339102 03-Oct-2018 mav

MFC r336943:
MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails

illumos/illumos-gate@b6bf6e1540f30bd97b8d6e2c21d95e17841e0f23

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

339034 01-Oct-2018 sef

MFC r334844, r336180, r336458

r334844

This originated from ZFS On Linux, as
https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

r336180

Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges. One of the missing changes
included a notification on scrub cancellation.

r336458

Fix a couple of typos in r334844 noticed by Richard Kojedzinszky

Approved by: mav
Sponsored by: iXsystems, Inc

339008 29-Sep-2018 sef

MFC r336017,r338799

r336017
This exposes ZFS user and group quotas via the normal
quatactl(2) mechanism. (Read-only at this point, however.)
In particular, this is to allow rpc.rquotad query quotas
for NFS mounts, allowing users to see their quotas on the
hosts using the datasets.

The changes specifically:

* Add new RPC entry points for querying quotas.
* Changes the library routines to allow non-UFS quotas.
* Changes rquotad to check for quotas on mounted filesystems,
rather than being limited to entries in /etc/fstab
* Lastly, adds a VFS entry-point for ZFS to query quotas.

Note that this makes one unavoidable behavioural change: if quotas
are enabled, then they can be queried, as opposed to the current
method of checking for quotas being specified in fstab. (With
ZFS, if there are user or group quotas, they're used, always.)

r338799
Author: kib

Fix ZFS VFS op quotactl to follow busy protocol.

Approved by: mav
Sponsored by: iXsystems, inc

338975 27-Sep-2018 mav

MFC r334810 (by benno), r338205, r338206:
r334810:
Break recursion involving getnewvnode and zfs_rmnode.

When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.

In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.

r338205:
Create separate taskqueue to call zfs_unlinked_drain().

r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every
deletion of a file with extended attributes. Using system_taskq for that
with its multiple threads in case of multiple files deletion caused all
available CPU threads to uselessly spin on busy locks, completely blocking
the system.

Use of single dedicated taskqueue is the only easy solution I've found,
while in would be great if we could specify that some task should be
executed only once at a time, but never in parallel, while many tasks
could use different threads same time.

r338206:
Add dmu_tx_assign() error handling in zfs_unlinked_drain().

The error handling got lost during r334810, while according to the report
error there may happen in case of dataset being over quota. In such case
just leave the node in the unlinked list to be freed sometimes later.

338904 24-Sep-2018 markj

MFC r338724:
Fix an nvpair leak in vdev_geom_read_config().

PR: 230704

338484 05-Sep-2018 kib

MFC r338370:
Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.

338456 04-Sep-2018 markj

MFC r338416:
Re-compute the ARC size before computing the MFU target.

338425 01-Sep-2018 markj

MFC r338365:
Add a sysctl for the ZFS abd_scatter_enabled setting.

338403 31-Aug-2018 mav

MFV r338288: Unblock speculative prefetcher also on pool creation.

Fix at r331950 appeared to be incomplete, fixing only case of pool
import, but not pool creation, leaving prefetcher still blocked for
newly created pools.

338346 28-Aug-2018 markj

MFC r338142:
Set arc_kmem_cache_reap_retry_ms to 0 and make it configurable.

335545 22-Jun-2018 avg

MFC r333630: Fix 'zpool create -t <tempname>'

Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.

333252 04-May-2018 emaste

MFC r333234: zfs_ioctl: avoid out-of-bound read

admbugs: 796
Submitted by: Domagoj Stolfa <ds815@cam.ac.uk>
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Approved by: re (early MFC as an EN candidate)

333194 03-May-2018 avg

MFC r332426: allow ZFS pool to have temporary name for duration of current import

The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name. This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name. This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.

Sponsored by: Panzura (porting)

332939 24-Apr-2018 markj

MFC r332364:
Assert that dtrace_probe() doesn't re-enter itself.

332785 19-Apr-2018 mav

MFC r332523: 9433 Fix ARC hit rate

When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

332755 19-Apr-2018 avg

MFC r331666: ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count

It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.

Sponsored by: Panzura

332554 16-Apr-2018 mav

MFC r331950: 9434 Speculative prefetch is blocked by device removal code.

Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal. At least one visual consequence
of it is completely blocked speculative prefetcher. This patch sets
the variable in such situations.

332553 16-Apr-2018 mav

MFC r331713: MFV r331712:
9280 Assertion failure while running removal_with_ganging test with 4K devices

illumos/illumos-gate@243952c7eeef020886e3e2e3df99a513df40584a

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>

332552 16-Apr-2018 mav

MFC r331711: MFV 331710:
9188 increase size of dbuf cache to reduce indirect block decompression

illumos/illumos-gate@268bbb2a2fa79c36d4695d13a595ba50a7754b76

With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)

In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

332551 16-Apr-2018 mav

MFC r331709: MFV r331708:
9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value

illumos/illumos-gate@9be12bd737714550277bd02b0c693db560976990

arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.

Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>

332550 16-Apr-2018 mav

MFC r331707: MFV r331706:
9235 rename zpool_rewind_policy_t to zpool_load_policy_t

illumos/illumos-gate@5dafeea3ebd2dd77affc802bcb90f63faf01589f

We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332549 16-Apr-2018 mav

MFC r331705: MFV 331704:
9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices

illumos/illumos-gate@ccef24b493bcbd146fcd6d8946666cae081470b6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332548 16-Apr-2018 mav

MFC r331703: MFV 331702:
9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate

illumos/illumos-gate@d1de72cfa29ab77ff80e2bb0e668a6afa5bccaf0

ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.

Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332547 16-Apr-2018 mav

MFC r331701: MFV r331695, 331700: 9166 zfs storage pool checkpoint

illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb_il.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool-features.7
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zpool_prop.c
common/Makefile.files
common/fs/zfs/dmu_traverse.c
common/fs/zfs/dnode.c
common/fs/zfs/dnode_sync.c
common/fs/zfs/dsl_dataset.c
common/fs/zfs/dsl_destroy.c
common/fs/zfs/dsl_dir.c
common/fs/zfs/dsl_pool.c
common/fs/zfs/dsl_scan.c
common/fs/zfs/dsl_synctask.c
common/fs/zfs/dsl_userhold.c
common/fs/zfs/metaslab.c
common/fs/zfs/range_tree.c
common/fs/zfs/spa.c
common/fs/zfs/spa_checkpoint.c
common/fs/zfs/spa_misc.c
common/fs/zfs/space_map.c
common/fs/zfs/sys/dmu.h
common/fs/zfs/sys/dsl_dir.h
common/fs/zfs/sys/dsl_pool.h
common/fs/zfs/sys/dsl_synctask.h
common/fs/zfs/sys/metaslab.h
common/fs/zfs/sys/metaslab_impl.h
common/fs/zfs/sys/range_tree.h
common/fs/zfs/sys/spa.h
common/fs/zfs/sys/spa_checkpoint.h
common/fs/zfs/sys/spa_impl.h
common/fs/zfs/sys/space_map.h
common/fs/zfs/sys/uberblock_impl.h
common/fs/zfs/sys/vdev.h
common/fs/zfs/sys/vdev_impl.h
common/fs/zfs/sys/vdev_removal.h
common/fs/zfs/sys/zio.h
common/fs/zfs/sys/zthr.h
common/fs/zfs/uberblock.c
common/fs/zfs/vdev.c
common/fs/zfs/vdev_indirect.c
common/fs/zfs/vdev_label.c
common/fs/zfs/vdev_removal.c
common/fs/zfs/zcp.c
common/fs/zfs/zcp_synctask.c
common/fs/zfs/zfs_ioctl.c
common/fs/zfs/zil.c
common/fs/zfs/zio.c
common/fs/zfs/zthr.c
common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
332544 16-Apr-2018 mav

MFC r331420 (by avg): zfs: fix mismatch between format specifier and type

vdev_dbgmsg_print_tree printed vdev_id of uint64_t type with %u format
specifier. That caused subsequent parameters to be incorrectly read
from the stack and lead to a crash when a wrong value was interpreted as
a string pointer.

This should be upstreamed.

332543 16-Apr-2018 mav

MFC r331414: Reduce struct aggsum_bucket padding to fit into one cache line.

332542 16-Apr-2018 mav

MFC r331408: MFV r331407: 9213 zfs: sytem typo

illumos/illumos-gate@edc8ef7d921c96b23969898aeb766cb24960bda7

Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Toomas Soome <tsoome@me.com>

332541 16-Apr-2018 mav

MFC r331406: MFV r331405: 9084 spa_*_ashift must ignore spare devices

illumos/illumos-gate@b037f3dbd69cef4a7ffd576ad33e07bfaf0b1e84

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

332540 16-Apr-2018 mav

MFC r331404: MFV r331400:
8484 Implement aggregate sum and use for arc counters

In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

332538 16-Apr-2018 mav

MFC r329805: MFV r329803:
9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

illumos/illumos-gate@bdfded42e66b9fc1395ff2401aa2952f7c44ae34

A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls
vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that
was already acquired from the first call to vdev_indirect_remap(). The specific scenario,
is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist
is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the
same vdev, which brings up the aforementioned issue.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>

332537 16-Apr-2018 mav

MFC r329802: MFV r329799, r329800:
9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b4f491d5e839644e0912cad1c9e7122

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

332536 16-Apr-2018 mav

MFC r329798: MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a2c5e23f4b970ea101137eadd1470a1

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332535 16-Apr-2018 mav

MFC r329783: 8942 zfs promote .../%recv should be an error

illumos/illumos-gate@add927f8c8d101e16c23eb9cd270be4fd7edf7d5

Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843,
fixed by https://github.com/zfsonlinux/zfs/pull/6339:

If we are in the middle of an incremental zfs receive, the child .../%recv
will exist. If you concurrently run zfs promote .../%recv, it will "work",
but then zfs gets confused. For example, there's no obvious way to destroy
the containing filesystem (because it is now a clone of its invisible child).

Attempting to do this promote should be an error. We could fix this by
having zfs_ioc_promote() check if zc_name contains a %, similar to
zfs_ioc_rename().

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

332534 16-Apr-2018 mav

MFC r329777: MFV r329776:
8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa)

illumos/illumos-gate@f4c1745bd6c9829a05ecec15759ede7757100ab5

Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think
this is the intended behaviour, we shouldn't be allowed to clear readonly
pools. Probably.

A fix is already in the ZFS on Linux repository to addess this issue:
https://github.com/zfsonlinux/zfs/commit/92e43c17188d47f47b69318e4884096dec380e36

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

332533 16-Apr-2018 mav

MFC r329775: MFV r329774:
8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly

illumos/illumos-gate@85723e5eec42f46dbfdb4c09b9e1ed66501d1ccf

When iterating over the input nvlist in dsl_props_set_sync_impl() when we
don't preserve the nvpair name before looking up ZPROP_VALUE, so when we
later go to process it nvpair_name() is always "value" instead of the actual
property name.

This results in a couple of bugs in the recv code:

- received properties are not restored correctly when failing to receive
an incremental send stream
- received properties are not completely replaced by the new ones when
successfully receiving an incremental send stream

This was discovered on ZFS on Linux (fixed in
https://github.com/zfsonlinux/zfs/commit/5f1346c29997dd4e02acf4c19c875d5484f33b1e)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

332532 16-Apr-2018 mav

MFC r329771: MFV r329770: 9035 zfs: this statement may fall through

illumos/illumos-gate@46ac8fdfc5a1f9d8240c79a6ae5b2889cbe83553

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>

332531 16-Apr-2018 mav

MFC r329769: MFV r329766: 8962 zdb should work on non-idle pools

illumos/illumos-gate@e144c4e6c90e7d4dccaad6db660ee42b6e7ba04f

Currently `zdb` consistently fails to examine non-idle pools as it fails
during the `spa_load()` process. The main problem seems to be that
`spa_load_verify()` fails as can be seen below:

$ sudo zdb -d -G dcenter
zdb: can't open 'dcenter': I/O error

ZFS_DBGMSG(zdb):
spa_open_common: opening dcenter
spa_load(dcenter): LOADING
disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
spa_load(dcenter): using uberblock with txg=40824950
spa_load(dcenter): UNLOADING
spa_load(dcenter): RELOADING
spa_load(dcenter): LOADING
disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
spa_load(dcenter): using uberblock with txg=40824952
spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is
done by creating a global flag in zfs and then setting it in `zdb`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332530 16-Apr-2018 mav

MFC r329765: MFV r329762: 8961 SPA load/import should tell us why it failed

illumos/illumos-gate@3ee8c80c747c4aa3f83351a6920f30c411236e1b

When we fail to open or import a storage pool, we typically don't get any
additional diagnostic information, just "no pool found" or "can not import".

While there may be no additional user-consumable information, we should at
least make this situation easier to debug/diagnose for developers and support.
For example, we could start by using `zfs_dbgmsg()` to log each thing that we
try when importing, and which things failed. E.g. "tried uberblock of txg X
from label Y of device Z". Also, we could log each of the stages that we go
through in `spa_load_impl()`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332529 16-Apr-2018 mav

MFC r329761: MFV r329760: 7638 Refactor spa_load_impl into several functions

illumos/illumos-gate@1fd3785ff6601d3e391378c2dcbf4c5f27e1fe32

spa_load_impl has grown out of proportions. It is currently over 700
lines long and makes it very hard to follow or debug the import process
even for experienced ZFS developers. The objective is to split it up
in a series of well commented functions.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

332528 16-Apr-2018 mav

MFC r329759:
9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon()

illumos/illumos-gate@36a64e62848b51ac5a9a5216e894ec723cfef14e

To prevent kmem_cache reaping from blocking other system resources, turn
kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers
to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which
exploits #9017's new taskq_empty().

Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Author: Tim Kordas <tim.kordas@joyent.com>

FreeBSD does not use taskqueue for kmem caches reaping, so this change
is less dramatic then it is on Illumos, just limiting reaping to 1 time
per second. It may possibly be improved later, if needed.

332526 16-Apr-2018 mav

MFC r329755: MFV r329753:
8809 libzpool should leverage work done in libfakekernel

illumos/illumos-gate@f06dce2c1f0f3af78581e7574f65bfba843ddb6e

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>

332525 16-Apr-2018 mav

MFC r329732: MFV r329502: 7614 zfs device evacuation/removal

illumos/illumos-gate@5cabbc6b49070407fb9610cfe73d4c0e0dea3e77

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfeature_common.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_deleg.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_deleg.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
common/Makefile.files
common/fs/zfs/arc.c
common/fs/zfs/bpobj.c
common/fs/zfs/dbuf.c
common/fs/zfs/ddt.c
common/fs/zfs/dmu.c
common/fs/zfs/dmu_objset.c
common/fs/zfs/dmu_tx.c
common/fs/zfs/dmu_zfetch.c
common/fs/zfs/dnode.c
common/fs/zfs/dsl_dataset.c
common/fs/zfs/dsl_deadlist.c
common/fs/zfs/dsl_destroy.c
common/fs/zfs/dsl_dir.c
common/fs/zfs/dsl_pool.c
common/fs/zfs/dsl_scan.c
common/fs/zfs/metaslab.c
common/fs/zfs/range_tree.c
common/fs/zfs/spa.c
common/fs/zfs/spa_config.c
common/fs/zfs/spa_misc.c
common/fs/zfs/space_map.c
common/fs/zfs/space_reftree.c
common/fs/zfs/sys/bpobj.h
common/fs/zfs/sys/dbuf.h
common/fs/zfs/sys/dmu.h
common/fs/zfs/sys/dnode.h
common/fs/zfs/sys/dsl_dataset.h
common/fs/zfs/sys/dsl_deadlist.h
common/fs/zfs/sys/dsl_deleg.h
common/fs/zfs/sys/dsl_dir.h
common/fs/zfs/sys/dsl_pool.h
common/fs/zfs/sys/dsl_scan.h
common/fs/zfs/sys/metaslab.h
common/fs/zfs/sys/metaslab_impl.h
common/fs/zfs/sys/range_tree.h
common/fs/zfs/sys/spa.h
common/fs/zfs/sys/spa_impl.h
common/fs/zfs/sys/space_map.h
common/fs/zfs/sys/vdev.h
common/fs/zfs/sys/vdev_impl.h
common/fs/zfs/sys/vdev_indirect_births.h
common/fs/zfs/sys/vdev_indirect_mapping.h
common/fs/zfs/sys/vdev_removal.h
common/fs/zfs/sys/zfs_debug.h
common/fs/zfs/sys/zil.h
common/fs/zfs/sys/zio.h
common/fs/zfs/sys/zio_priority.h
common/fs/zfs/txg.c
common/fs/zfs/vdev.c
common/fs/zfs/vdev_disk.c
common/fs/zfs/vdev_file.c
common/fs/zfs/vdev_geom.c
common/fs/zfs/vdev_indirect.c
common/fs/zfs/vdev_indirect_births.c
common/fs/zfs/vdev_indirect_mapping.c
common/fs/zfs/vdev_label.c
common/fs/zfs/vdev_mirror.c
common/fs/zfs/vdev_missing.c
common/fs/zfs/vdev_queue.c
common/fs/zfs/vdev_raidz.c
common/fs/zfs/vdev_removal.c
common/fs/zfs/vdev_root.c
common/fs/zfs/zcp_get.c
common/fs/zfs/zfs_ioctl.c
common/fs/zfs/zfs_vfsops.c
common/fs/zfs/zfs_vnops.c
common/fs/zfs/zil.c
common/fs/zfs/zio.c
common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
332524 16-Apr-2018 mav

MFC r307317: MFV r307313:
5120 zfs should allow large block/gzip/raidz boot pool (loader project)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

openzfs/openzfs@c8811bd3e2427dddbac6c05a59cfe117d8fea370

FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.

331848 31-Mar-2018 smh

MFC r330950:

Prevent ZFS TRIM breaking VTOC8 partitions

Sponsored by: Multiplay

331721 29-Mar-2018 mav

MFC r329738: MFV r329736: 8969 Cannot boot from RAIDZ with parity > 1

illumos/illumos-gate@0fb055e81fd0cda5221da8ddd98b2f8d1fc6bdbe

At present it is possible to boot from a root pool that is on RAIDZ but not
one that is on RAIDZ2 or RAIDZ3. This is because, at the time the pool
version is checked to ensure support for dual/triple parity, the uberblock
has not yet been loaded into the SPA and therefore the code determines that
the pool version is too old and returns ENOTSUP.

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andy Fiddaman <omnios@citrus-it.co.uk>

FreeBSD already had this fixed, so this is just a diff reduction.

331611 27-Mar-2018 avg

MFC r330974: MFV r330973: 9164 assert: newds == os->os_dsl_dataset

PR: 225877

331553 26-Mar-2018 markj

MFC r331134:
Fix an access of an uninitialized variable in dtrace_probe().

331399 22-Mar-2018 mav

MFC r329694: MFV r324198: 8081 Compiler warnings in zdb

illumos/illumos-gate@3f7978d02b206a6ebc5652c91aa9f42da6fbe00c
https://github.com/illumos/illumos-gate/commit/3f7978d02b206a6ebc5652c91aa9f42da6fbe00c

https://www.illumos.org/issues/8081
zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
which uses -WError, the only way to build it is to disable all compiler
warnings. This makes it much harder to detect newly introduced bugs. We should
cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>

331397 23-Mar-2018 mav

MFC r329690: MFV r319737: 6939 add sysevents to zfs core for commands

illumos/illumos-gate@ce1577b04976f1d8bb5f235b6eaaab15b46a3068
https://github.com/illumos/illumos-gate/commit/ce1577b04976f1d8bb5f235b6eaaab15b46a3068

https://www.illumos.org/issues/6939
Originally created https://smartos.org/bugview/OS-4489
sysevents should be fired in the kernel from ZFS whenever a command
is run that is logged in zpool history.
Example output
Terminal 1
root - gz sunos ~ # zfs create zones/foobar
root - gz sunos ~ # zfs set quota=10g zones/foobar
root - gz sunos ~ # zfs destroy zones/foobar
Terminal 2
root - gz sunos ~ # sysevent EC_zfs
nvlist version: 0
date = 2016-04-28T14:50:08.964Z
vendor = SUNW
publisher = zfs
class = EC_zfs
subclass = ESC_ZFS_history_event
pid = 0
data = (embedded nvlist)
nvlist version: 0
pool_name = zones
pool_guid = 0x40c964e8f9a7a694
history_record = (embedded nvlist)
nvlist version: 0
dsname = zones/foobar
dsid = 0x1525
history internal str =
internal_name = create
history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>

331396 23-Mar-2018 mav

MFC r329683: MFV r319736: 6396 remove SVM

illumos/illumos-gate@5f10ef697f250374b7b917e10961c4e02d4e3112
https://github.com/illumos/illumos-gate/commit/5f10ef697f250374b7b917e10961c4e02d4e3112

https://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>

331395 22-Mar-2018 mav

MFC r329681: MFV r318941: 7446 zpool create should support efi system partition

illumos/illumos-gate@7855d95b30fd903e3918bad5a29b777e765db821
https://github.com/illumos/illumos-gate/commit/7855d95b30fd903e3918bad5a29b777e765db821

https://www.illumos.org/issues/7446
Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.
https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.

331385 22-Mar-2018 mav

MFC r329628: MFC r316910: 7812 Remove gender specific language

illumos/illumos-gate@48bbca816818409505a6e214d0911fda44e622e3
https://github.com/illumos/illumos-gate/commit/48bbca816818409505a6e214d0911fda44e622e3

https://www.illumos.org/issues/7812
This change removes all gendered language that did not refer specifically
to an individual person or pet. The convention taken was to use
variations on "they" when referring to users and/or human beings, while
using "it" when referring to code, functions, and/or libraries.
Additionally, we took the liberty to fix up any whitespace issues that
were found in any files that were already being modified.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>

331384 22-Mar-2018 mav

MFC r329625: MFV r307315:
7301 zpool export -f should be able to interrupt file freeing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175

331383 22-Mar-2018 mav

MFC r329623: MFV r302649: 7016 arc_available_memory is not 32-bit safe

illumos/illumos-gate@0dd053d7d890618ea1fc697b07de364e69eb4190
https://github.com/illumos/illumos-gate/commit/0dd053d7d890618ea1fc697b07de364e69eb4190

https://www.illumos.org/issues/7016
upstream DLPX-39446 arc_available_memory is not 32-bit safe
https://github.com/delphix/delphix-os/commit/
6b353ea3b8a1610be22e71e657d051743c64190b
related to this upstream:
DLPX-38547 delphix engine hang
https://github.com/delphix/delphix-os/commit/
3183a567b3e8c62a74a65885ca60c86f3d693783
DLPX-38547 delphix engine hang (fix static global)
https://github.com/delphix/delphix-os/commit/
22ac551d8ef085ad66cc8f65e51ac372b12993b9
DLPX-38882 system hung waiting on free segment
https://github.com/delphix/delphix-os/commit/
cdd6beef7548cd3b12f0fc0328eeb3af540079c2

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Prakash Surya <prakash.surya@delphix.com>

331302 21-Mar-2018 avg

MFC r330592: MFV r330591: 8984 fix for 6764 breaks ACL inheritance

PR: 216886

331017 15-Mar-2018 kevans

MFC r317055,r317056 (glebius): Include sys/vmmeter.h as included

r317055: All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.

r317056: Typo!


/freebsd-11-stable/sys/amd64/amd64/efirt.c
/freebsd-11-stable/sys/amd64/amd64/minidump_machdep.c
/freebsd-11-stable/sys/amd64/amd64/uma_machdep.c
/freebsd-11-stable/sys/arm/arm/intr.c
/freebsd-11-stable/sys/arm/arm/machdep.c
/freebsd-11-stable/sys/arm/arm/trap-v4.c
/freebsd-11-stable/sys/arm/arm/trap-v6.c
/freebsd-11-stable/sys/arm/arm/undefined.c
/freebsd-11-stable/sys/arm64/arm64/minidump_machdep.c
/freebsd-11-stable/sys/arm64/arm64/uma_machdep.c
/freebsd-11-stable/sys/cddl/compat/opensolaris/kern/opensolaris_kmem.c
/freebsd-11-stable/sys/cddl/compat/opensolaris/sys/kmem.h
common/fs/zfs/zfs_vnops.c
/freebsd-11-stable/sys/dev/drm/drmP.h
/freebsd-11-stable/sys/dev/drm2/drmP.h
/freebsd-11-stable/sys/fs/msdosfs/msdosfs_denode.c
/freebsd-11-stable/sys/fs/msdosfs/msdosfs_vnops.c
/freebsd-11-stable/sys/kern/kern_mib.c
/freebsd-11-stable/sys/kern/kern_thread.c
/freebsd-11-stable/sys/kern/subr_intr.c
/freebsd-11-stable/sys/kern/subr_syscall.c
/freebsd-11-stable/sys/mips/include/intr_machdep.h
/freebsd-11-stable/sys/mips/mips/minidump_machdep.c
/freebsd-11-stable/sys/mips/mips/uma_machdep.c
/freebsd-11-stable/sys/ofed/drivers/infiniband/core/umem.c
/freebsd-11-stable/sys/powerpc/powerpc/uma_machdep.c
/freebsd-11-stable/sys/sparc64/sparc64/intr_machdep.c
/freebsd-11-stable/sys/sparc64/sparc64/machdep.c
/freebsd-11-stable/sys/sparc64/sparc64/mem.c
/freebsd-11-stable/sys/ufs/ffs/ffs_balloc.c
/freebsd-11-stable/sys/ufs/ffs/ffs_vfsops.c
/freebsd-11-stable/sys/vm/device_pager.c
/freebsd-11-stable/sys/vm/memguard.c
/freebsd-11-stable/sys/vm/sg_pager.c
/freebsd-11-stable/sys/vm/vm_reserv.c
/freebsd-11-stable/sys/x86/x86/intr_machdep.c
/freebsd-11-stable/sys/x86/xen/xenpv.c
330991 15-Mar-2018 avg

MFC r329363: read-behind / read-ahead support for zfs_getpages()

330990 15-Mar-2018 avg

MFC r329823: another rework of getzfsvfs / getzfsvfs_impl code

330988 15-Mar-2018 avg

MFC r330057: add ZFS_ENTER protection to .zfs/snapshot vnode operations that need it

330986 15-Mar-2018 avg

MFC r329717: MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue

330824 13-Mar-2018 mav

MFC r330048: Add sysctls/tunables for dbuf cache size.

330734 10-Mar-2018 asomers

MFC r329412:

zfs: fix formatting in a log statement

Submitted by: Dave Baukus <daveb@spectralogic.com>
Sponsored by: Spectra Logic Corp

330732 10-Mar-2018 asomers

MFC r329265, r329384

r329265:
Implement .vop_pathconf and .vop_getacl for the .zfs ctldir

zfsctl_common_pathconf will report all the same variables that regular ZFS
volumes report. zfsctl_common_getacl will report an ACL equivalent to 555,
except that you can't read xattrs or edit attributes.

Fixes a bug where "ls .zfs" will occasionally print something like:
ls: .zfs/.: Operation not supported

PR: 225793
Reviewed by: avg
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D14365

r329384:
Handle generic pathconf attributes in the .zfs ctldir

MFC instructions: change the value of _PC_LINK_MAX to INT_MAX

Reported by: jhb
X-MFC-With: 329265
Sponsored by: Spectra Logic Corp

330590 07-Mar-2018 avg

MFC r329719: MFV r329718: 8520 7198 lzc_rollback_to should support rolling back to origin

330588 07-Mar-2018 avg

MFC r329714: MFV r329713: 8731 ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks

330523 05-Mar-2018 asomers

MFC r326400

Revert r326399

Accidentally committed wrong file

Pointy hat to: asomers
Sponsored by: Spectra Logic Corp

330237 01-Mar-2018 avg

MFC r329314: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio

PR: 223803

330235 01-Mar-2018 avg

MFC r329711: MFV r329710: 8966 use after end of the lifetime of a local variable

PR: 225162

330070 27-Feb-2018 avg

MFC r329556,r329820 remove an assert in zfsctl_snapdir_lookup to match r323578

330064 27-Feb-2018 avg

MFC r329016: remove a duplicate assignment

330062 27-Feb-2018 avg

MFC r328881: zfs: move a utility function, ioflags, closer to its consumers

330061 27-Feb-2018 avg

MFC r328776: ZFS ARC: restore illumos uses of 'needfree' that were removed in r325851

330060 27-Feb-2018 avg

MFC r328626: zfs_rezget: drop cached pages before doing anything else

330058 27-Feb-2018 avg

MFC r328217: zfs: no need to check that size of zfs_cmd_t is not greater than IOCPARM_MAX

329847 23-Feb-2018 mav

MFC r329091: Add sysctls for dnode block and indirect block shifts.

329780 22-Feb-2018 asomers

MFC r326399:

Fix assertion when ZFS fails to open certain devices

"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:

1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.

The solution is to change the initialization order to ensure that the
assertion will be satisfied.

PR: 221066
Reported by: David NewHamlet <wheelcomplex@gmail.com>
Reviewed by: avg
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13278

329494 18-Feb-2018 mav

MFC r328254:
MFV r328253: 8835 Speculative prefetch in ZFS not working for misaligned reads

illumos/illumos-gate@5cb8d943bc8513c6230589aad5a409d58b0297cb

https://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance. It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>

329493 18-Feb-2018 mav

MFC r328252: MFV r328251: 8652 Tautological comparisons with ZPROP_INVAL

illumos/illumos-gate@4ae5f5f06c6c2d1db8167480f7d9e3b5378ba2f2

https://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>

329491 18-Feb-2018 mav

MFC r328248: MFV r328247:
8959 Add notifications when a scrub is paused or resumed

illumos/illumos-gate@301fd1d6f25595cd8c6d6795f39c72d97aff8cd9

Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>

329490 18-Feb-2018 mav

MFC r328246:
MFV r328245: 8856 arc_cksum_is_equal() doesn't take into account ABD-logic

illumos/illumos-gate@01a059ee0cdece49f47fd4d70086dd5bc7d0b0ff

https://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>

329487 18-Feb-2018 mav

MFC r328230: MFV r328229:
8930 zfs_zinactive: do not remove the node if the filesystem is readonly

illumos/illumos-gate@93c618e0f4932dc0bb9a9c90d8c4a5d029de5797

https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.

One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).

Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>

329486 18-Feb-2018 mav

MFC r328228: MFV r328227: 8909 8585 can cause a use-after-free kernel panic

illumos/illumos-gate@94ddd0900a8838f62bba15e270649a42f4ef9f81

https://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only

> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race is present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

329485 18-Feb-2018 mav

MFC r328226: MFV r328225:
8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock"

illumos/illumos-gate@cf07d3da9915c0d22da8f59e991639f819463cef

https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".

Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

329484 18-Feb-2018 mav

MFC r328224: MFV r328220: 8677 Open-Context Channel Programs

illumos/illumos-gate@a3b2868063897ff0083dea538f55f9873eec981f

https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.

This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

329481 17-Feb-2018 mav

MFC r321104 (by jhibbits): Make ZFS not crash on mount on 32-bit systems

ZPL_VERSION is unsigned long long, not an int. With this change, a zpool
can be created on a 32-bit system (tested on powerpcspe) and mounted
correctly.

329249 13-Feb-2018 markj

MFC r327888, r327972, r327973:
Add "jid" and "jailname" variables to DTrace.

328298 23-Jan-2018 jhb

MFC 320900,323882,324224,324226,324228,326986,326988,326989,326990,326993,
326994,326995,327004: Various fixes for pathconf(2).

The original change to use vop_stdpathconf() more widely was motivated
by a panic due to recent AIO-related changes. However, bde@ reported
that vop_stdpathconf() contained too many settings that were not
filesystem-independent. The end result of this set of patches is to
fix the AIO-related panic via use of a trimmed-down vop_stdpathconf()
while also adding support for missing pathconf variables in various
filesystems (and removing a few settings incorrectly reported as
supported).

320900:
Consistently use vop_stdpathconf() for default pathconf values.

Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values. vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits. Filesystems can still
explicitly override individual settings.

323882:
Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices.

Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files. In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.

324224:
Handle _PC_FILESIZEBITS and _PC_SYMLINK_MAX pathconf() requests in cd9660.

cd9660 only supports symlinks with Rock Ridge extensions, so
_PC_SYMLINK_MAX is conditional on Rock Ridge.

324226:
Return 64 for pathconf(_PC_FILESIZEBITS) on tmpfs.

324228:
Flesh out pathconf() on UDF.

- Return 64 bits for _PC_FILESIZEBITS.
- Handle _PC_SYMLINK_MAX.
- Defer _PC_PATH_MAX to vop_stdpathconf().

326986:
Add a custom VOP_PATHCONF method for fdescfs.

The method handles NAME_MAX and LINK_MAX explicitly. For all other
pathconf variables, the method passes the request down to the underlying
file descriptor. This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf(). Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.

326988:
Add a custom VOP_PATHCONF method for fuse.

This method handles _PC_FILESIZEBITS, _PC_SYMLINK_MAX, and _PC_NO_TRUNC.
For other values it defers to vop_stdpathconf().

326989:
Support _PC_FILESIZEBITS in msdosfs' VOP_PATHCONF().

326990:
Handle _PC_FILESIZEBITS and _PC_NO_TRUNC for smbfs' VOP_PATHCONF().

326993:
Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().

Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.

326994:
Handle _PC_FILESIZEBITS and _PC_SYMLINK_MAX for devfs' VOP_PATHCONF().

326995:
Use FUSE_LINK_MAX for LINK_MAX in fuse' VOP_PATHCONF().

Should have included this in r326993.

327004:
Rework pathconf handling for FIFOs.

On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

PR: 219851
Sponsored by: Chelsio Communications

328235 22-Jan-2018 mav

MFC r322245: MFV r322242: 8373 TXG_WAIT in ZIL commit path

illumos/illumos-gate@d28671a3b094af696bea87f52272d4c4d89321c7
https://github.com/illumos/illumos-gate/commit/d28671a3b094af696bea87f52272d4c4d89321c7

https://www.illumos.org/issues/8373
The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign
a transaction to a transaction group. That seems to be logically
incorrect as writing of the ZIL block does not introduce any new dirty
data. Also, when there is a lot of dirty data, the call can introduce
significant delays into the ZIL commit path, thus affecting all
synchronous writes. Additionally, ARC throttling may affect the ZIL
writing.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

328049 16-Jan-2018 avg

MFC r327725: zfs_mount: restore a bit of ifdef-out illumos code

327551 04-Jan-2018 markj

MFC r326774, r326811:
Pass the trap frame to fasttrap hooks.

327491 02-Jan-2018 markj

MFC r326919:
Unregister the ARC lowmem event handler earlier in arc_fini().

327470 01-Jan-2018 dim

MFC r327167:

Remove obsolete register keyword from opensolaris's sysmacros.h. When
compiling zfsd with recent clang, it leads to a warning about the
register storage class being incompatible with C++17.

327188 26-Dec-2017 asomers

MFC r326401:

Fix assertion when ZFS fails to open certain devices

"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:

1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.

The solution is to change the initialization order to ensure that the
assertion will be satisfied.

PR: 221066
Reported by: David NewHamlet <wheelcomplex@gmail.com>
Reviewed by: avg
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13278

326980 19-Dec-2017 markj

MFC r326813:
MFV r326785: 8880 improve DTrace error checking

326701 08-Dec-2017 markj

MFC r326134:
Duplicate helpers after disabling inherited tracepoints during a fork.

326619 06-Dec-2017 bapt

MFC r325851:

remove the poor emulation of the IllumOS needfree global variable to prevent
the ARC reclaim thread running longer than needed.

Update the arc::needfree dtrace probe triggered in arc_lowmem() to also report
the value we may want to free.

Submitted by: Nikita Kozlov <nikita.kozlov at blade-group.com>
Reviewed by: avg
Approved by: avg
Sponsored by: blade
Differential Revision: https://reviews.freebsd.org/D12163

326427 01-Dec-2017 avg

MFC r326070: zfs_write: fix problem with writes appearing to succeed when over quota

The problem happens when the writes have offsets and sizes aligned with
a filesystem's recordsize (maximum block size). In this scenario
dmu_tx_assign() would fail because of being over the quota, but the uio
would already be modified in the code path where we copy data from the
uio into a borrowed ARC buffer. That makes an appearance of a partial
write, so zfs_write() would return success and the uio would be modified
consistently with writing a single block.

That bug can result in a data loss because the writes over the quota
would appear to succeed while the actual data is being discarded.

This commit fixes the bug by ensuring that the uio is not changed until
after all error checks are done. To achieve that the code now uses
uiocopy() + uioskip() as in the original illumos design. We can do that
now that uiocopy() has been updated in r326067 to use
vn_io_fault_uiomove().

326319 28-Nov-2017 asomers

MFC r324940:

Fix the error message when creating a zpool on a too-small device

Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path.
It's redundant with the check in vdev_open, and failing to attach here
results in the wrong error message being printed. However, still check for
it in some other situations:

* When opening by guids, so we don't get bogged down reading from slow
devices like floppy drives.
* In vdev_geom_read_pool_label for the same reason, because we iterate over
all providers.
* If the caller requests that we verify the guid, because then we'll have to
read from the device before vdev_open verifies the size.

PR: 222227
Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Reviewed by: avg, mav
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12531

326129 23-Nov-2017 markj

MFC r325887:
Avoid holding the process in uread() and uwrite().

325931 17-Nov-2017 avg

MFC r325610: MFV r325609: 7531 Assign correct flags to prefetched buffers

325912 16-Nov-2017 avg

MFC r325228: vdev_geom_close: close errored consumer even if vdev_reopening is set

325911 16-Nov-2017 avg

MFC r325608: MFV r325607: 8607 zfs: variable set but not used

325541 08-Nov-2017 avg

MFC r324195: MFV r323795: 8604 Avoid unnecessary work search in VFS when unmounting snapshots

325539 08-Nov-2017 avg

MFC r324757: remove spa_sync_on assert from spa_async_thread_vd

325538 08-Nov-2017 avg

MFC r324197: MFV r323913: 8600 ZFS channel programs - snapshot

325537 08-Nov-2017 avg

MFC r324196: MFV r323912: 8592 ZFS channel programs - rollback

325536 08-Nov-2017 avg

MFC r324170: MFV r323794: 8605 zfs channel programs: zfs.exists undocumented and non-working

325535 08-Nov-2017 avg

MFC r324168: MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best()

325534 08-Nov-2017 avg

MFC r324163: MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups

Also MFC-ed are the following fixes:
- r324164 Add several new files to the files enabled by ZFS kernel option
- r324178 unbreak kernel builds on sparc64 and powerpc
- r324194 fix incorrect use of getzfsvfs_impl in r324163
- r324292 really unbreak kernel builds on sparc64 and powerpc64


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs-program.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs.8
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_impl.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_util.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/kernel.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h
/freebsd-11-stable/cddl/lib/libzpool/Makefile
/freebsd-11-stable/cddl/sbin/zfs/Makefile
/freebsd-11-stable/sys/cddl/compat/opensolaris/kern/opensolaris_sunddi.c
/freebsd-11-stable/sys/cddl/compat/opensolaris/sys/sunddi.h
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
common/Makefile.files
common/fs/zfs/dsl_dataset.c
common/fs/zfs/dsl_destroy.c
common/fs/zfs/dsl_dir.c
common/fs/zfs/lua
common/fs/zfs/lua/lstrlib.c
common/fs/zfs/sys/dsl_dataset.h
common/fs/zfs/sys/dsl_destroy.h
common/fs/zfs/sys/dsl_dir.h
common/fs/zfs/sys/zcp.h
common/fs/zfs/sys/zcp_global.h
common/fs/zfs/sys/zcp_iter.h
common/fs/zfs/sys/zcp_prop.h
common/fs/zfs/sys/zfs_ioctl.h
common/fs/zfs/sys/zfs_vfsops.h
common/fs/zfs/zcp.c
common/fs/zfs/zcp_get.c
common/fs/zfs/zcp_global.c
common/fs/zfs/zcp_iter.c
common/fs/zfs/zcp_synctask.c
common/fs/zfs/zfs_ioctl.c
common/fs/zfs/zfs_vfsops.c
common/sys/fs/zfs.h
/freebsd-11-stable/sys/conf/files
/freebsd-11-stable/sys/conf/kern.pre.mk
/freebsd-11-stable/sys/modules/zfs/Makefile
325153 30-Oct-2017 avg

MFC r324349: MFV r322235: 8067 zdb should be able to dump literal embedded block pointer

325132 30-Oct-2017 avg

MFC r324011, r324016: MFV r323535: 8585 improve batching done in zil_commit()

FreeBSD notes:
- this MFV reverts FreeBSD commit r314549 to make the merge easier
- at present our emulation of cv_timedwait_hires is rather poor,
so I elected to use cv_timedwait_sbt directly
Please see the differential revision for details.
Unfortunately, I did not get any positive reviews, so there could be
bugs in the FreeBSD-specific piece of the merge.
Hence, the long MFC timeout.

illumos/illumos-gate@1271e4b10dfaaed576c08a812f466f6e81370e5e
https://github.com/illumos/illumos-gate/commit/1271e4b10dfaaed576c08a812f466f6e81370e5e

https://www.illumos.org/issues/8585
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

324756 19-Oct-2017 ian

MFC r323985:

Use nstosbt() instead of multiplying by SBT_1NS to avoid roundoff errors.

Differential Revision: https://reviews.freebsd.org/D11779

324282 04-Oct-2017 markj

MFC r324066:
Use C99 initializers for DTrace provider methods.

324253 04-Oct-2017 avg

MFC r323483: zfsctl_snapdir_lookup should be able to handle an uncovered vnode

324252 04-Oct-2017 avg

MFC r323481: zfsvfs_hold: assert that the busied filesystem can not be unmounted

324205 02-Oct-2017 avg

MFC r323433,r323793,r323915: MFV r323110: 8558 lwp_create() returns EAGAIN
on system with more than 80K ZFS filesystems, and followups

r323433: MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems
r323793: MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure
r323915: MFV r323914: 8661 remove "zil-cw2" dtrace probe

324203 02-Oct-2017 avg

MFC r323918: MFV r323917: 8648 Fix range locking in ZIL commit codepath

This fixes a problem introduced in r315441, MFC of r308782.

324161 01-Oct-2017 avg

MFV r323796: fix memory leak in g_bio zone introduced in r320452

I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include
ZIO_STAGE_VDEV_IO_DONE stage. We do allocate a struct bio for an ioctl
zio (a disk cache flush), but we never freed it.

This change splits bio handling into two groups, one for normal
read/write i/o that passes data around and, thus, needs the abd data
tranform; the other group is for "data-less" i/o such as trim and cache
flush.

PR: 222288

324160 01-Oct-2017 avg

MFC r323797: add vfs_zfs.abd_chunk_size tunable

It is reported that the default value of 4KB results in a substantial
memory use overhead (at least, on some configurations). Using 1KB seems
to reduce the overhead significantly.

PR: 222377

324158 01-Oct-2017 avg

MFC r323522: slightly simplify zfs_vptocnp

324063 27-Sep-2017 asomers

MFC r323813:

MFV r323789: 8473 scrub does not detect errors on active spares

illumos/illumos-gate@554675eee75dd2d7398d960aa5c81083ceb8505a
https://github.com/illumos/illumos-gate/commit/554675eee75dd2d7398d960aa5c81083ceb8505a

https://www.illumos.org/issues/8473
Scrubbing is supposed to detect and repair all errors in the pool. However,
it wrongly ignores active spare devices. The problem can easily be
reproduced in OpenZFS at git rev 0ef125d with these commands:

truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c,
# which should've already been fixed

FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more.

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Sponsored by: Spectra Logic Corp

324010 26-Sep-2017 avg

MFC r323355: MFV r323107: 8414 Implemented zpool scrub pause/resume

illumos/illumos-gate@1702cce751c5cb7ead878d0205a6c90b027e3de8
https://github.com/illumos/illumos-gate/commit/1702cce751c5cb7ead878d0205a6c90b027e3de8

FreeBSD note: rather than merging the zpool.8 update I copied the zpool
scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost
verbatim. Now that the illumos page uses the mdoc format, it was an
easier option. Perhaps the change is not in perfect compliance with the
FreeBSD style, but I think that it is acceptible.

https://www.illumos.org/issues/8414
This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
Currently, there is no way to pause a scrub. Pausing may be useful when
the pool is busy with other I/O to preserve bandwidth.

Description

This patch adds the ability to pause and resume scrubbing. This is achieved
by maintaining a persistent on-disk scrub state. While the state is 'paused'
we do not scrub any more blocks. We do however perform regular scan
housekeeping such as freeing async destroyed and deadlist blocks while paused.

Motivation and Context

Scrub pausing can be an I/O intensive operation and people have been asking
for the ability to pause a scrub for a while. This allows one to preserve scrub
progress while freeing up bandwidth for other I/O.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>

324008 26-Sep-2017 avg

MFC r323480: zfs_get_vfs: reference a requested filesystem instead of vfs_busy-ing it

Sponsored by: Panzura

324004 26-Sep-2017 avg

MFC r323479,r323491: zfs: tighten debug versions of ZTOV and VTOZ

Sponsored by: Panzura

323761 19-Sep-2017 avg

MFC r323482: zfs_ctldir: remove obsolete / bogus ARGSUSED lint directives

None of the tagged functions had unused parameters.

323760 19-Sep-2017 avg

MFC r323435: MFV r323111: 8569 problem with inline functions in abd.h

illumos/illumos-gate@37e84ab74e939caf52150fc3352081786ecc0c29
https://github.com/illumos/illumos-gate/commit/37e84ab74e939caf52150fc3352081786ecc0c29

https://www.illumos.org/issues/8569
C [C99] has peculiar rules for inline functions that are different from the
C++ rules. Unlike C++ where inline is "fire and forget", in C a programmer
must pay attention to the function's storage class / visibility. The main
problem is with the case where a compiler decides to not inline a call to the
function declared as inline.
Some relevant links:
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html
- http://www.drdobbs.com/the-new-c-inline-functions/184401540
The summary is that either the inline functions should be declared 'static
inline' or one of the compilation units (.c files) must provide a callable
externally visible function definition. In the former case, the compiler would
automatically create a local non-inlined function instance in every compilation
unit where it's needed. In the latter case the single external definition is
used to satisfy any non-inlined calls in all compilation units. As things
stand right now, we can get an undefined reference error under certain
combinations of compilers and compiler options. For example, this is what I
get on FreeBSD when compiling with clang 4.0.0 and -O1:
In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385:
undefined reference to `abd_is_linear'

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323758 19-Sep-2017 avg

MFC r322241: MFV r322240: 8491 uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS

illumos/illumos-gate@79c2b812ee2010ebf20fdd92dc5f06b59000a94c
https://github.com/illumos/illumos-gate/commit/79c2b812ee2010ebf20fdd92dc5f06b59000a94c

https://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

323757 19-Sep-2017 avg

MFC r322230: MFV r322229: 7600 zfs rollback should pass target snapshot to kernel

illumos/illumos-gate@77b171372ed21642e04c873ef1e87fe2365520df
https://github.com/illumos/illumos-gate/commit/77b171372ed21642e04c873ef1e87fe2365520df

https://www.illumos.org/issues/7600
At present, the kernel side code seems to blindly rollback to whatever happens
to be the latest snapshot at the time when the rollback task is processed.
The expected target's name should be passed to the kernel driver and the sync
task should validate that the target exists and that it is the latest snapshot
indeed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323755 19-Sep-2017 avg

MFC r322228: MFV r322227: 8377 Panic in bookmark deletion

illumos/illumos-gate@42418f9e73f0d007aa87675ecc206c26fc8e073e
https://github.com/illumos/illumos-gate/commit/42418f9e73f0d007aa87675ecc206c26fc8e073e

https://www.illumos.org/issues/8377
The problem is that when dsl_bookmark_destroy_check() is executed from open
context (the pre-check), it fills in dbda_success based on the existence of the
bookmark.
But the bookmark (or containing filesystem as in this case) can be destroyed
before we get to syncing context. When we re-run dsl_bookmark_destroy_check()
in syncing
context, it will not add the deleted bookmark to dbda_success, intending for
dsl_bookmark_destroy_sync() to not process it. But because the bookmark is
still in dbda_success
from the open-context call, we do try to destroy it.
The fix is that dsl_bookmark_destroy_check() should not modify dbda_success
when called from open context.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

323754 19-Sep-2017 avg

MFC r322222: MFV r322221: 7910 l2arc_write_buffers() may write beyond target_sz

FreeBD note: the essence of this change was committed to FreeBSD in
r314274. This commit catches up with differences between what was
committed to FreeBSD and what was committed to OpenZFS, mainly more
logical variable names.

illumos/illumos-gate@16a7e5ac116c85d965007a5f201104b564e82210
https://github.com/illumos/illumos-gate/commit/16a7e5ac116c85d965007a5f201104b564e82210

https://www.illumos.org/issues/7910
It seems that the change in issue #6950 resurrected the problem that was
earlier fixed by the change in issue #5219.
Please also see the following FreeBSD bug report:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216178

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323753 19-Sep-2017 avg

MFC r322234: zfs: no need for __DECONST after abd constification in r322233

Note that vdev_label_write_pad2() is FreeBSD specific.

323752 19-Sep-2017 avg

MFC r322239: MFV r322238: 7915 checks in l2arc_evict could use some cleaning up

illumos/illumos-gate@267ae6c3a88d2fc39276af66caafa978b0935b82
https://github.com/illumos/illumos-gate/commit/267ae6c3a88d2fc39276af66caafa978b0935b82

https://www.illumos.org/issues/7915
l2arc_evict() is strictly serialized with respect to
l2arc_write_buffers() and l2arc_write_done(). Normally, l2arc_evict()
and l2arc_write_buffers() are called from the same thread, so they can
not be concurrent. Also, l2arc_write_buffers() uses zio_wait() on the
parent zio of all cache zio-s. That ensures that l2arc_write_done()
is completed before l2arc_write_buffers() returns. Finally, if a
cache device is removed, then l2arc_evict() is called under SCL_ALL in
the exclusive mode. That ensures that it can not be concurrent with
the normal L2ARC accesses to the device (including writing and
evicting buffers). Given the above, some checks and actions in
l2arc_evict() do not make sense. For instance, it must never
encounter the write head header let alone remove it from the buffer
list.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323750 19-Sep-2017 avg

MFC r322237: MFV r322236: 8126 ztest assertion failed in dbuf_dirty due to dn_nlevels changing

illumos/illumos-gate@dcb6872c565819ac88acbc2ece999ef241c8b982
https://github.com/illumos/illumos-gate/commit/dcb6872c565819ac88acbc2ece999ef241c8b982

https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

323749 19-Sep-2017 avg

MFC r322233: MFV r322232: 8426 mark immutable buffer arguments as such in abd.h

illumos/illumos-gate@9b195260e22529ac0e2580faaf89402420589c1c
https://github.com/illumos/illumos-gate/commit/9b195260e22529ac0e2580faaf89402420589c1c

https://www.illumos.org/issues/8426
abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
qualify them with const.
abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
corresponding arguments.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

323748 19-Sep-2017 avg

MFC r322226: MFV r322223: 8378 crash due to bp in-memory modification of nopwrite block

illumos/illumos-gate@b7edcb940884114e61382937505433c4c38c0278
https://github.com/illumos/illumos-gate/commit/b7edcb940884114e61382937505433c4c38c0278

https://www.illumos.org/issues/8378
The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which
we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr
to zgd_bp, dbuf_write_ready()
could change db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale
BP and that the dbuf it not dirty, so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the
db_mtx. We could still see a stale
db_blkptr, but if it is stale then the dirty record will still exist and thus
we won't attempt to nopwrite.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

323746 19-Sep-2017 avg

MFC r321471: spa_import_rootpool should be able to handle an imported root pool

That is required to support reboot -r with a new root filesystem being
on an already imported pool.

PR: 210721

323667 17-Sep-2017 bapt

MFC r323051:

Add sysctls for arc shrinking and growing values

The default value for arc_no_grow_shift may not be optimal when using
several GiB ARC. Expose it via sysctl allows users to tune it easily.

Also expose arc_grow_retry via sysctl for the same reason. The default value of
60s might, in case of intensive load, be too long.

Submitted by: Nikita Kozlov <nikita.kozlov@blade-group.com>
Reviewed by: mav, manu, bapt
Sponsored by: blade
Differential Revision: https://reviews.freebsd.org/D12144

323285 07-Sep-2017 asomers

MFC r322546:

Fix some ZFS debugging messages

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Be more careful about the use of provider names vs vdev names in
ZFS_LOG statements.

Sponsored by: Spectra Logic Corp

321614 27-Jul-2017 mav

MFC r320352: zfs: port vdev_file part of illumos change 3306

3306 zdb should be able to issue reads in parallel
illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1
https://www.illumos.org/issues/3306

The upstream change was made before we started to import upstream commits
individually. It was imported into the illumos vendor area as r242733.
That commit was MFV-ed in r260138, but as the commit message says
vdev_file.c was left intact.

This commit actually implements the parallel I/O for vdev_file using a
taskqueue with multiple thread. This implementation does not depend on
the illumos or FreeBSD bio interface at all, but uses zio_t to pass
around all the relevent data. So, the code looks a bit different from
the upstream.

This commit also incorporates ZoL commit
zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed
https://github.com/zfsonlinux/zfs/issues/2270
We need to use a dedicated taskqueue for exactly the same reason as ZoL
as we do not implement TASKQ_DYNAMIC.

Obtained from: illumos, ZFS on Linux

321613 27-Jul-2017 mav

MFC r320239: MFV r319950:
5220 L2ARC does not support devices that do not provide 512B access

FreeBSD note: the actual change has been in FreeBSD since r297848. This
commit accounts for integration of that change with subsequent changes,
especially r320156 (MFV of r318946) and r314274.

illumos/illumos-gate@403a8da73c64ff9dfb6230ba045c765a242213fb
https://github.com/illumos/illumos-gate/commit/403a8da73c64ff9dfb6230ba045c765a242213fb

https://www.illumos.org/issues/5220
There are disk devices that have logical sector size larger than 512B, for
example 4KB. That is, their physical sector size is larger than 512B and they
do not provide emulation for 512B sector sizes. For such devices both a data
offset and a data size must be properly aligned. L2ARC should arrange that
because it uses physical I/O.
zio_vdev_io_start() performs a necessary transformation if io_size is not
aligned to vdev_ashift, but that is done only for logical I/O. Something
similar should be done in L2ARC code.
* a temporary write buffer should be allocated if the original buffer is
not going to be compressed and its size is not aligned
* size of a temporary compression buffer should be ashift aligned
* for the reads, if a size of a target buffer is not sufficiently large and
it is not aligned then a temporary read buffer should be allocated

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

321612 27-Jul-2017 mav

MFC r320238: MFV r319742: 8056 zfs send size estimate is inaccurate for some zvols

illumos/illumos-gate@0255edcc85fc0cd1dda0e49bcd52eb66c06a1b16
https://github.com/illumos/illumos-gate/commit/0255edcc85fc0cd1dda0e49bcd52eb66c06a1b16

https://www.illumos.org/issues/8056
The send size estimate for a zvol can be too low, if the size of the record
headers (dmu_replay_record_t's) is a significant portion of the size.
This is typically the case when the data is highly compressible, especially
with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
blocks are the size of the "recordsize" property (128KB).
However, for zvols, the blocks are the size of the "volblocksize" property
(8KB). Therefore, we estimate that there will be 16x less record headers than
there really will be.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

321611 27-Jul-2017 mav

MFC r320237: MFV r318947: 7578 Fix/improve some aspects of ZIL writing.

FreeBSD note: this commit removes small differences between what mav
committed to FreeBSD in r308782 and what ended up committed to illumos
after addressing all review comments.

illumos/illumos-gate@c5ee46810f82e8a53d2cc5a487568a573f449039
https://github.com/illumos/illumos-gate/commit/c5ee46810f82e8a53d2cc5a487568a573f449039

https://www.illumos.org/issues/7578
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>

321610 27-Jul-2017 mav

MFC r320156, r320185, r320186, r320262, r320452, r321111:
MFV r318946: 8021 ARC buf data scatter-ization

illumos/illumos-gate@770499e185d15678ccb0be57ebc626ad18d93383
https://github.com/illumos/illumos-gate/commit/770499e185d15678ccb0be57ebc626ad1
8d93383

https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linea
r
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)

FreeBSD notes:
- the actual (default) chunk size is 4KB (despite the text above saying 1KB)
- we can try to reimplement ABDs, so that they are not permanently
mapped into the KVA unless explicitly requested, especially on
platforms with scarce KVA
- we can try to use unmapped I/O and avoid intermediate allocation of a
linear, virtual memory mapped buffer
- we can try to avoid extra data copying by referring to chunks / pages
in the original ABD

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>

321609 27-Jul-2017 mav

MFC r320153: revert r315852 which introduced zio_buf_alloc_nowait for use
in vdev_queue_aggregate

I think that the change is still good, but reconciling it with a planned
merge of the ARC buf data scatter-ization is a bit more tedious
than I can handle.

321591 26-Jul-2017 emaste

MFC r321218: zfs: Fix a typo in the delay_min_dirty_percent sysctl description

The description is FreeBSD-specific and was added in r266497
to fix PR189865.

PR: 220825
Submitted by: Fabian Keil
Obtained from: ElectroBSD

321579 26-Jul-2017 mav

MFC r319953: MFV r319951: 8311 ZFS_READONLY is a little too strict

illumos/illumos-gate@2889ec41c05e9ffe1890b529b3111354da325aeb
https://github.com/illumos/illumos-gate/commit/2889ec41c05e9ffe1890b529b3111354d
a325aeb

https://www.illumos.org/issues/8311
Description:
There was a misunderstanding about the enforcement details of the "Read-only"
flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC
2007/315 case.
The original authors thought enforcement of the READONLY flag should work
similarly as the IMMUTABLE flag. Unfortunately, that enforcement is
incompatible with the expectations of Windows applications using this feature
through the SMB service. Applications assume (and the MS File System Algorithm
s
MS-FSA confirms they should) that an SMB client can:
(a) Open an SMB handle on a file with read/write access,
(b) Set the DOS attributes to include the READONLY flag,
(c) continue to have write access via that handle.
This access model is essentially the same as a Unix/POSIX application that
creates a file (with read/write access), uses fchmod() to change the file mode
to something not granting write access (i.e. 0444), and then continues to writ
e
that file using the open handle it got before the mode change.
Currently, the SMB server works-around this problem in a way that will become
difficult to maintain as we implement support for SMB3 persistent handles, so
SMB depends on this fix.
I've written a test program that can be used to demonstrate this problem, and
added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos).
It currently fails, but will pass when this problem fixed.
Steps to Reproduce:
Run the test program on a ZFS file system.
Expected Results:
Pass
Actual Results:
Fail.

Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Prakash Surya <prakash.surya@delphix.com>
Author: Gordon Ross <gwr@nexenta.com>

321578 26-Jul-2017 mav

MFC r319949: MFV r319948: 5428 provide fts(), reallocarray(), and strtonum()

illumos/illumos-gate@4585130b259133a26efae68275dbe56b08366deb
https://github.com/illumos/illumos-gate/commit/4585130b259133a26efae68275dbe56b08366deb

https://www.illumos.org/issues/5428

Most of the upstream change is not applicable to FreeBSD.
Only the renaming of strtonum to zfs_strtonum is relevant to us.
And we already had it partially done.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>

321577 26-Jul-2017 mav

MFC r319947: MFV r319945,r319946: 8264 want support for promoting datasets in libzfs_core

illumos/illumos-gate@a4b8c9aa65a0a735aba318024a424a90d7b06c37
https://github.com/illumos/illumos-gate/commit/a4b8c9aa65a0a735aba318024a424a90d7b06c37

https://www.illumos.org/issues/8264
Oddly there is a lzc_clone function, but no lzc_promote function.

Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Andrew Stormont <astormont@racktopsystems.com>

321575 26-Jul-2017 mav

MFC r319750: MFV r319741: 8156 dbuf_evict_notify() does not need dbuf_evict_lock

illumos/illumos-gate@dbfd9f930004c390a2ce2cf850c71b4f880eef9c
https://github.com/illumos/illumos-gate/commit/dbfd9f930004c390a2ce2cf850c71b4f880eef9c

https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321574 26-Jul-2017 mav

MFC r319749: MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configuration
s

illumos/illumos-gate@5b062782532a1d5961c4a4b655906e1238c7c908
https://github.com/illumos/illumos-gate/commit/5b062782532a1d5961c4a4b655906e123
8c7c908

https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321573 26-Jul-2017 mav

MFC r319748: MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers

illumos/illumos-gate@adaec86ad212d9fd756bee322934fa54d1258605
https://github.com/illumos/illumos-gate/commit/adaec86ad212d9fd756bee322934fa54d1258605

https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321570 26-Jul-2017 mav

MFC r318945: MFV r318944: 8265 Reserve send stream flag for large dnode feature

illumos/illumos-gate@bc83969fdbd1cb0d97ba00218c0a3de5c89fba92
https://github.com/illumos/illumos-gate/commit/bc83969fdbd1cb0d97ba00218c0a3de5c
89fba92

https://www.illumos.org/issues/8265
Reserve bit 23 in the zfs send stream flags for the large
dnode feature which has been implemented for Linux.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brian Behlendorf <behlendorf1@llnl.gov>

321568 26-Jul-2017 mav

MFC r318935: MFV r318934: 8070 Add some ZFS comments

illumos/illumos-gate@40713f2b249d289022c715107b3951055a63aef0
https://github.com/illumos/illumos-gate/commit/40713f2b249d289022c715107b3951055a63aef0

https://www.illumos.org/issues/8070
Add some ZFS comments left by various developers at different times

Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>

321567 26-Jul-2017 mav

MFC r318932: MFV r318931: 8063 verify that we do not attempt to access inactive txg

illumos/illumos-gate@b7b2590dd9f11b12a0b4878db3886068cce176af
https://github.com/illumos/illumos-gate/commit/b7b2590dd9f11b12a0b4878db3886068cce176af

https://www.illumos.org/issues/8063
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321566 26-Jul-2017 mav

MFC r318930: MFV r318929: 7786 zfs`vdev_online() needs better notification about state changes

illumos/illumos-gate@5f368aef86387d6ef4eda84030ae9b402313ee4c
https://github.com/illumos/illumos-gate/commit/5f368aef86387d6ef4eda84030ae9b402313ee4c

https://www.illumos.org/issues/7786
Currently, vdev_online() will only post sysevent if previous state was
"offline". It should also post the event when the state changes from "removed"
or "faulted" to "healthy" or "degraded".
This will fix the following scenario:
- pull disk from slot A
- check that hotspare has taken its place (if available)
- insert disk into slot B
- check that hotspare moved back to "avail" state (if spare was used)
The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail
to update the vdev FRU.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Approved by: Albert Lee <trisk@forkgnu.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>

321565 26-Jul-2017 mav

MFC r318928: MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf

illumos/illumos-gate@def4fac5882b4ca67bd0f4a53509b6d1fa8ae14e
https://github.com/illumos/illumos-gate/commit/def4fac5882b4ca67bd0f4a53509b6d1fa8ae14e

https://www.illumos.org/issues/8025
dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321563 26-Jul-2017 mav

MFC r318925: MFV r316929: 6914 kernel virtual memory fragmentation leads to hang

illumos/illumos-gate@af868f46a5b794687741d5424de9e3a2d684a84a
https://github.com/illumos/illumos-gate/commit/af868f46a5b794687741d5424de9e3a2d684a84a

https://www.illumos.org/issues/6914

FreeBSD note: only a ZFS part of the change is merged, changes to the VM
subsystem are not ported (obviously). Also, now that FreeBSD has
vmem(9) we don't have to ifdef-out the code that uses it.

321562 26-Jul-2017 mav

MFC r318924 (by avg):
arc_init: make code closer to upstream by introducing 'allmem' variable

All the differences in calculations are kept.
A comment about arc_max being 1/2 of all memory is fixed to reflect the
actual code that uses 5/8 as a factor.

321561 26-Jul-2017 mav

MFC r318923: zfs_putpages: assert that sa_bulk_update() must succeed

Same as the upstream does in r316927.

321559 26-Jul-2017 mav

MFC r318921: MFV r316928: 7256 low probability race in zfs_get_data

illumos/illumos-gate@0c94e1af6784c69a1dea25e0e35dd13b2b91e2e5
https://github.com/illumos/illumos-gate/commit/0c94e1af6784c69a1dea25e0e35dd13b2b91e2e5

https://www.illumos.org/issues/7256
error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= zp->z_blksz);
It's possible, although extremely rare, that the zfs_get_done() callback is
executed before dmu_sync() returns.
In that case the znode's range lock is dropped and the znode is unreferenced.
Thus, the assertion can access some invalid or wrong data via the zp pointer.
size variable caches the correct value of z_blksz and can be safely used here.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

321558 26-Jul-2017 mav

MFC r318920: MFV r316924: 8061 sa_find_idx_tab can be declared more type-safely

illumos/illumos-gate@7f0bdb4257bb4f1f76390b72665961e411da24c6
https://github.com/illumos/illumos-gate/commit/7f0bdb4257bb4f1f76390b72665961e411da24c6

https://www.illumos.org/issues/8061
sa_find_idx_tab() is declared as taking and returning "void *" parameters.
These can be declared to be the specific types.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321556 26-Jul-2017 mav

MFC r318833: MFV r316925: 6101 attempt to lzc_create() a filesystem under a volume results in a panic

illumos/illumos-gate@b127fe3c059af7adf772735498680b4f2e1405ef
https://github.com/illumos/illumos-gate/commit/b127fe3c059af7adf772735498680b4f2e1405ef

https://www.illumos.org/issues/6101
lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
create a filesystem as a child of a volume, instead it proceeds to a crash.
A crash stack obtained on FreeBSD:
page fault while in kernel mode

zap_leaf_lookup()
fzap_lookup()
zap_lookup_norm()
zap_lookup()
zfs_get_zplprop()
zfs_fill_zplprops_impl()
zfs_ioc_create()
zfsdev_ioctl()
devfs_ioctl_f()
kern_ioctl()
sys_ioctl()
This crash happened with a kernel without debugging assertions.
The immediate cause of crash appears to an attempt to interpret a zvol object
as a zap object.
For filesystems:
#define MASTER_NODE_OBJ 1
For zvols:
#define ZVOL_OBJ 1ULL
#define ZVOL_ZAP_OBJ 2ULL
So, I see two problems here:
1. an attempt to create a filesystem under a zvol should be rejected as
early as possible, maybe in zfs_fill_zplprops()
2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
of the zap object types

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>

321554 26-Jul-2017 mav

MFC r318829: MFV r316920: 8023 Panic destroying a metaslab deferred range tree

illumos/illumos-gate@3991b535a8e990c0369be677746a87c259b13e9f
https://github.com/illumos/illumos-gate/commit/3991b535a8e990c0369be677746a87c259b13e9f

https://www.illumos.org/issues/8023
$C
ffffff0011bc0970 vpanic()
ffffff0011bc0a00 strlog()
ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00)
ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380)
ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800)
ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000)
ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0)
ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000)
ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000)
ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68)
ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003,
ffffff03e1956b10, ffffff0011bc0e68, 0)
ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190)
ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>

321553 26-Jul-2017 mav

MFC r318828: MFV r316917: 7968 multi-threaded spa_sync()

illumos/illumos-gate@94c2d0eb22e9624151ee84a7edbf7178e1bf4087
https://github.com/illumos/illumos-gate/commit/94c2d0eb22e9624151ee84a7edbf7178e1bf4087

https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling
dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321552 26-Jul-2017 mav

MFC r318827: MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists

illumos/illumos-gate@10fbdecb05f411234920f8d3c92c148d39106d7e
https://github.com/illumos/illumos-gate/commit/10fbdecb05f411234920f8d3c92c148d39106d7e

https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321550 26-Jul-2017 mav

MFC r318824: MFV r316915: 7801 add more by-dnode routines (lint)

illumos/illumos-gate@411be58a6e030a3b606f1afcc7f2e2459ffda844
https://github.com/illumos/illumos-gate/commit/411be58a6e030a3b606f1afcc7f2e2459ffda844

321549 26-Jul-2017 mav

MFC r318823: MFC r316914: 7801 add more by-dnode routines

illumos/illumos-gate@b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe
https://github.com/illumos/illumos-gate/commit/b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe

https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/0eef1bde31d67091d3deed23fe2394f5a8bf2276

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>

321548 26-Jul-2017 mav

MFC r318822: MFC r316913: 7869 panic in bpobj_space(): null pointer dereference

illumos/illumos-gate@a3905a45920de250d181b66ac0b6b71bd200d9ef
https://github.com/illumos/illumos-gate/commit/a3905a45920de250d181b66ac0b6b71bd200d9ef

https://www.illumos.org/issues/7869
The issue fixed by this patch is a race condition in the deadlist code.
A thread executing an administrative command that uses
`dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
protect the access of all its entries that the deadlist contains in an
avl tree.
Sync threads trying to insert a new entry in the deadlist
(through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
sentinel value), we close and reopen it. Between these two operations,
it is possible for the `dsl_deadlist_space_range()` thread to dereference
that bpobj which is `NULL` during that window.
Threads should hold the a deadlist's `dl_lock` when they manipulate its
internal data so scenarios like the one above are avoided. In addition,
threads should also hold the bpobj lock whenever they are allocating the
subobj list of a bpobj, and not just when they actually insert the subobj
to the list. This way we can avoid potential memory leaks.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

321547 26-Jul-2017 mav

MFC r318821: MFV r316912: 7793 ztest fails assertion in dmu_tx_willuse_space

illumos/illumos-gate@61e255ce7267b52208af9daf434b77d37fb75622
https://github.com/illumos/illumos-gate/commit/61e255ce7267b52208af9daf434b77d37
fb75622

https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321545 26-Jul-2017 mav

MFC r318818: MFV r316907: 1300 filename normalization doesn't work for removes

illumos/illumos-gate@1c17160ac558f98048951327f4e9248d8f46acc0
https://github.com/illumos/illumos-gate/commit/1c17160ac558f98048951327f4e9248d8f46acc0

https://www.illumos.org/issues/1300

FreeBSD note: recent FreeBSD was not affected by the issue fixed as the
name cache is completely bypassed when normalization is enabled.
The change is imported for the sake of ZAP infrastructure modifications.

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

321542 26-Jul-2017 mav

MFC r317648: Fix misport of compressed ZFS send/recv from 317414

Reported by: Michael Jung <mikej@mikej.com>

321541 26-Jul-2017 mav

MFC r317541: MFV 316905

7740 fix for 6513 only works in hole punching case, not truncation

illumos/illumos-gate@7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553
https://github.com/illumos/illumos-gate/commit/7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553

https://www.illumos.org/issues/7740
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
To verify, create a large file, truncate it to a short length, and then
write beyond the end. Check with zdb to make sure that there are no
holes with birth time zero (will appear as gaps).

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>

321540 26-Jul-2017 mav

MFC r317533: MFV 316900

7743 per-vdev-zaps have no initialize path on upgrade

illumos/illumos-gate@555da5111b0f2552c42d057b211aba89c9c79f6c
https://github.com/illumos/illumos-gate/commit/555da5111b0f2552c42d057b211aba89c9c79f6c

https://www.illumos.org/issues/7743
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

321539 26-Jul-2017 mav

MFC r317527: MFV 316898

7613 ms_freetree[4] is only used in syncing context

illumos/illumos-gate@5f145778012b555e084eacc858ead9e1e42bd149
https://github.com/illumos/illumos-gate/commit/5f145778012b555e084eacc858ead9e1e42bd149

https://www.illumos.org/issues/7613
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
replace it with two trees: the freeing tree (ranges that we are freeing this
syncing txg) and the freed tree (ranges which have been freed this txg).

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321538 26-Jul-2017 mav

MFC r317522: MFV 316897

7586 remove #ifdef __lint hack from dmu.h

illumos/illumos-gate@4ba5b9616327ef64e8abc737d29b3faabc6ae68c
https://github.com/illumos/illumos-gate/commit/4ba5b9616327ef64e8abc737d29b3faabc6ae68c

https://www.illumos.org/issues/7586
The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
we add other inline functions into header files in ZFS, especially since it is
difficult to make any other solution work across all compilation targets. We
should switch to disabling the lint flags that are failing instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>

321537 26-Jul-2017 mav

MFC r317511: MFV 316896

7580 ztest failure in dbuf_read_impl

illumos/illumos-gate@1a01181fdc809f40c64d5c6881ae3e4521a9d9c7
https://github.com/illumos/illumos-gate/commit/1a01181fdc809f40c64d5c6881ae3e4521a9d9c7

https://www.illumos.org/issues/7580
We need to prevent any reader whenever we're about the zero out all the
blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
dbuf_write_children_ready and free_children just prior to calling bzero.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>

321536 26-Jul-2017 mav

MFC r317507: MFV 316895

7606 dmu_objset_find_dp() takes a long time while importing pool

illumos/illumos-gate@7588687e6ba67c47bf7c9805086dec4a97fcac7b
https://github.com/illumos/illumos-gate/commit/7588687e6ba67c47bf7c9805086dec4a97fcac7b

https://www.illumos.org/issues/7606
When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().
There are several ways to improve performance here:
1. We don't really need to do spa_check_logs() or
spa_ld_claim_log_blocks() if the pool was closed cleanly.
2. spa_load_verify() uses dmu_objset_find_dp() to check that no
datasets have too long of names.
3. dmu_objset_find_dp() is slow because it's doing
zap_value_search() (which is O(N sibling datasets)) to determine
the name of each dsl_dir when it's opened. In this case we
actually know the name when we are opening it, so we can provide
it and avoid the lookup.
This change implements fix #3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

321535 26-Jul-2017 mav

MFC r317414: MFV 316894

7252 7628 compressed zfs send / receive

illumos/illumos-gate@5602294fda888d923d57a78bafdaf48ae6223dea
https://github.com/illumos/illumos-gate/commit/5602294fda888d923d57a78bafdaf48ae6223dea

https://www.illumos.org/issues/7252
This feature includes code to allow a system with compressed ARC enabled to
send data in its compressed form straight out of the ARC, and receive data in
its compressed form directly into the ARC.

https://www.illumos.org/issues/7628
We should have longer, more readable versions of the ZFS send / recv options.

7628 create long versions of ZFS send / receive options

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>

321533 26-Jul-2017 mav

MFC r317238: MFV 316871

7490 real checksum errors are silenced when zinject is on

illumos/illumos-gate@6cedfc397d92d64e442f0aae4445ac507beaf58f
https://github.com/illumos/illumos-gate/commit/6cedfc397d92d64e442f0aae4445ac507beaf58f

https://www.illumos.org/issues/7490
When zinject is on, error codes from zfs_checksum_error() can be overwritten
due to an incorrect and overly-complex if condition.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

321532 26-Jul-2017 mav

MFC r317237: MFV 316870

7448 ZFS doesn't notice when disk vdevs have no write cache

illumos/illumos-gate@295438ba3230419314faaa889a2616f561658bd5
https://github.com/illumos/illumos-gate/commit/295438ba3230419314faaa889a2616f56
1658bd5

https://www.illumos.org/issues/7448
I built a SmartOS image with all the NVMe commits including 7372
(support NVMe volatile write cache) and repeated my dd testing:
> #!/bin/bash
> for i in `seq 1 1000`; do
> dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
> dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
> wait
> rm file00 file01
> done
>
Previously each dd command took ~145 seconds to finish, now it takes
~400 seconds.
Eventually I figured out it is 7372 that causes unnecessary
nvme_bd_sync() executions which wasted CPU cycles.
If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
return ENOTSUP to indicate this to upper layers.
It seems this returned value is ignored by ZFS, and as such this bug is not
really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
nvme driver in general does support cache flush. Instead it will issue an
asynchronous flush to nvme and immediately return 0, and hence ZFS will not se
t
vdev_nowritecache here. The nvme driver will at some point process the cache
flush command, and if there is no write cache on the device it will return
ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
function will not check the error code and not set nowritecache.
The right place to check the error code from the cache flush is in
zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
cache flushes. This would also be independent of the implementation detail that
some drivers can return ENOTSUP immediately.

Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>

321531 26-Jul-2017 mav

MFC r317235: MFV 316868

7430 Backfill metadnode more intelligently

illumos/illumos-gate@af346df58864e8fe897b1ff1a3a4c12f9294391b
https://github.com/illumos/illumos-gate/commit/af346df58864e8fe897b1ff1a3a4c12f9
294391b

https://www.illumos.org/issues/7430
Description and patch from brought over from the following ZoL commit: https:/
/
github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don?t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can?t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>

321530 26-Jul-2017 mav

MFC r316037: MFV: 315989

7603 xuio_stat_wbuf_* should be declared (void)

illumos/illumos-gate@99aa8b55058e512798eafbd71f72f916bdc10181
https://github.com/illumos/illumos-gate/commit/99aa8b55058e512798eafbd71f72f916bdc10181

https://www.illumos.org/issues/7603

The funcs are declared k&r style, where the args are not specified:

void xuio_stat_wbuf_copied();
They should be declared to take no arguments:

void xuio_stat_wbuf_copied(void);
Need to change both .c and .h.

Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

321529 26-Jul-2017 mav

MFC r315896: MFV r315290, r315291: 7303 dynamic metaslab selection

illumos/illumos-gate@8363e80ae72609660f6090766ca8c2c18aa53f0c
https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18

https://www.illumos.org/issues/7303

This change introduces a new weighting algorithm to improve metaslab selection
.
The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a res
ult,
the metaslab weight now encodes the type of weighting algorithm used
(size-based vs segment-based).

This also introduce a new allocation tracing facility and two new dcmds to hel
p
debug allocation problems. Each zio now contains a zio_alloc_list_t structure
that is populated as the zio goes through the allocations stage. Here's an
example of how to use the tracing facility:

> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 ENOSPC ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 ENOSPC ztest.0a
1 0 400 1 x 8M 17b1a00 ztest.0a

> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 200 0 NOT_ALLOCATABLE mirror-2
- 0 200 0 NOT_ALLOCATABLE mirror-0
1 0 200 1 x 4M 112ae00 mirror-1
- 1 200 0 NOT_ALLOCATABLE mirror-2
- 1 200 0 NOT_ALLOCATABLE mirror-0
1 1 200 1 x 4M 112b000 mirror-1
- 2 200 0 NOT_ALLOCATABLE mirror-2

If the metaslab is using segment-based weighting then the WEIGHT column will
display the number of segments available in the bucket where the allocation
attempt was made.

Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

321528 26-Jul-2017 mav

MFC r314280: MFV 314276

7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL

illumos/illumos-gate@1c9272b861cd640a8342f4407da026ed98615517
https://github.com/illumos/illumos-gate/commit/1c9272b861cd640a8342f4407da026ed98615517

https://www.illumos.org/issues/7570

Based on the discovery that every unmap waits for the commit of the txn to the ZIL,
introducing a very high latency to unmap commands, this behavior was made into a
tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is
that by default SCSI unmap commands will result in space being freed within the zvol
(today they are ignored and returned with good status). However, unlike the code
today, instead of 18+ms per unmap, they take about 30us.

With the testing done on NTFS against a Win2k12 target, the new behavior should work
seamlessly. Files on the zvol that have already been set with the zfree application
will continue to write 0's when deleted, and any new files created since zvol
creation will send unmap commands when deleted. This behavior exists today, but with
this change the unmap commands will be processed and result in reclaim of space.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

321527 26-Jul-2017 mav

MFC r314267: MFV 314243

6676 Race between unique_insert() and unique_remove() causes ZFS fsid change

illumos/illumos-gate@40510e8eba18690b9a9843b26393725eeb0f1dac
https://github.com/illumos/illumos-gate/commit/40510e8eba18690b9a9843b26393725ee
b0f1dac

https://www.illumos.org/issues/6676

The fsid of zfs filesystems might change after reboot or remount. The problem
seems to
be caused by a race between unique_insert() and unique_remove(). The unique_re
move()
is called from dsl_dataset_evict() which is now an asynchronous thread. In a c
ase the
dsl_dataset_evict() thread is very slow and calls unique_remove() too late we
will end
up with changed fsid on zfs mount.

This problem is very likely caused by #5056.

Steps to Reproduce
Note: I'm able to reproduce this always on a single core (virtual) machine. On
multicore
machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

Impact
The persistent fsid (filesystem id) is essential for proper NFS functionality.
If the fsid of a filesystem changes on remount (or after reboot) the NFS
clients might not be able to automatically recover from such event and the
manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

321524 26-Jul-2017 mav

MFC r313813: MFV 313786

7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid

illumos/illumos-gate@653af1b809998570c7e89fe7a0d3f90992bf0216
https://github.com/illumos/illumos-gate/commit/653af1b809998570c7e89fe7a0d3f90992bf0216

https://www.illumos.org/issues/7500
With the integration of:

commit 0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date: Sat Jul 26 13:40:04 2014 -0800
4873 zvol unmap calls can take a very long time for larger datasets

the dnode's dn_bufs field was changed from a list to a tree. As a result,
the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>

321523 26-Jul-2017 mav

MFC r312535: MFV 312436

6569 large file delete can starve out write ops

illumos/illumos-gate@ff5177ee8bf9a355131ce2cc61ae2da6a5a6fdd6
https://github.com/illumos/illumos-gate/commit/ff5177ee8bf9a355131ce2cc61ae2da
6a5a6fdd6

https://www.illumos.org/issues/6569
The core issue I've found is that there is no throttle for how many
deletes get assigned to one TXG. As a results when deleting large files
we end up filling consecutive TXGs with deletes/frees, then write
throttling other (more important) ops.

There is an easy test case for this problem. Try deleting several
large files (at least 1/2 TB) while you do write ops on the same
pool. What we've seen is performance of these write ops (let's
call it sideload I/O) would drop to zero.

More specifically the problem is that dmu_free_long_range_impl()
can/will fill up all of the dirty data in the pool "instantly",
before many of the sideload ops can get in. So sideload
performance will be impacted until all the files are freed.

The solution we have tested at Nexenta (with positive results)
creates a relatively simple throttle for how many "free" ops we let
into one TXG.

However this solution exposes other problems that should also be
addressed. If we are to slow down freeing of data that means one
has to wait even longer (assuming vnode ref count of 1) to get shell
back after an rm or for NFS thread to finish the free-ing op.
To avoid this the proposed solution is to call zfs_inactive() async
for "large" files. Async freeing then begs for the reclaimed space
to be accounted for in the zpool's "freeing" prop.

The other issue with having a longer delete is the inability to
export/unmount for a longer period of time. The proposed solution
is to interrupt freeing of blocks when a fs is unmounted.

Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

321521 26-Jul-2017 mav

MFC r305701 (by allanjude): MFV r268120:
4936 lz4 could theoretically overflow a pointer with a certain input

illumos/illumos-gate@58d0718061c87e3d647c891ec5281b93c08dba4e

319624 06-Jun-2017 gjb

MFC r318943 (avg):

MFV r318942: 8166 zpool scrub thinks it repaired offline device

https://www.illumos.org/issues/8166
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
See also https://github.com/zfsonlinux/zfs/issues/5806

PR: 219537
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation

319422 01-Jun-2017 avg

MFC r318832: MFV r316923: 8026 retire zfs_throttle_delay and zfs_throttle_resolution

319420 01-Jun-2017 avg

MFC r318830: MFV r316921: 8027 tighten up dsl_pool_dirty_delta

319415 01-Jun-2017 avg

MFC r319096: zfs_lookup: fix bogus arguments to lookup of "snapshot" directory

319091 29-May-2017 avg

MFC r308826: zfs: fix up after the removal of PG_CACHED pages in r308691

Now that r308691 has been MFC-ed as a part of r318716,
r308826 must be MFC-ed as well.

PR: 214629
Reported by: mshirk@daemon-security.com [head], lev [stable/11]

318785 24-May-2017 avg

MFC r316854: rename vfs.zfs.debug_flags to vfs.zfs.debugflags

Since this is a stable branch vfs.zfs.debug_flags sysctl is also kept.
The corresponing tunable could never work.

318716 23-May-2017 markj

MFC r308474, r308691, r309203, r309365, r309703, r309898, r310720,
r308489, r308706:
Add PQ_LAUNDRY and remove PG_CACHED pages.

318648 22-May-2017 asomers

MFC r318189:

vdev_geom may associate multiple vdevs per g_consumer

vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.

Fix this by storing a linked list of vdevs in g_consumer's private field.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

* g_consumer.private now stores a linked list of vdev pointers associated
with the consumer instead of just a single vdev pointer.

* Change vdev_geom_set_physpath's signature to more closely match
vdev_geom_set_rotation_rate

* Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed
that we've already accessed the consumer by the time we get here.

* Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it
in vdev_geom_open, after we know that the open has succeeded.

PR: 218634
Reviewed by: gibbs
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10391

317833 05-May-2017 asomers

MFC r316760:

Fix vdev_geom_attach_by_guids for partitioned disks

When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.

The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot. Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.

Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.

Reviewed by: mav
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10365

317470 26-Apr-2017 smh

MFC r315449:

Reduce ARC fragmentation threshold

Sponsored by: Multiplay

317469 26-Apr-2017 smh

MFC r316460:

Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children

Sponsored by: Multiplay

316849 14-Apr-2017 avg

MFC r315852: zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate

316847 14-Apr-2017 avg

MFC r315853: zfs_putpages: use TXG_WAIT

316391 02-Apr-2017 asomers

MFC r313483:

Fix setting birthtime in ZFS

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
set the bits that zfs_settattr expects

* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
expected by zfs_xvattr_set. The two levels of indirection seem
excessive, but it minimizes diffs vs OpenZFS.

* In zfs_setattr, check for overflow of va_birthtime (from delphij)

* Remove red herring in zfs_getattr

sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h
* Un-booby-trap some macros

New tests are under review at https://github.com/pjd/pjdfstest/pull/6

Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9353

316210 30-Mar-2017 gnn

MFC: 313176, 313177, 313359

Replace the implementation of DTrace's RAND subroutine for generating
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.

Submitted by: Graeme Jenkinson

315849 23-Mar-2017 avg

MFC r315076: zfs: provide a special vptocnp method for the .zfs vnode

315842 23-Mar-2017 avg

MFC r314048,r314194: reimplement zfsctl (.zfs) support

315834 23-Mar-2017 avg

MFC r314913: MFV r314911: 7867 ARC space accounting leak

315832 23-Mar-2017 avg

MFC r314912: MFV r314910: 7843 get_clones_stat() is suboptimal for lots of clones

315446 17-Mar-2017 mav

MFC r309856: Postpone ZVOL media/block size caching till first open.

At least on FreeBSD there are no legal way to access media or get its
size without opening device/provider first. Postponing this caching
allows to skip several disk seeks per ZVOL/snapshot during import.

For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces
pool import time from 40 to 10 seconds.

315445 17-Mar-2017 mav

MFC r309833: Add missed vfs.zfs.zfetch.max_idistance sysctl.

315444 17-Mar-2017 mav

MFC r308099:
Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz.

315441 17-Mar-2017 mav

MFC r308782:
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by: iXsystems, Inc.

315440 17-Mar-2017 mav

MFC r307397: Add vfs.zfs.zil_log_limit sysctl.

It is at least partially broken now, but that is another question.

315387 16-Mar-2017 mav

MFC r314549: Execute last ZIO of log commit synchronously.

For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

315384 16-Mar-2017 mav

MFC r314548: Completely skip cache flushing for not supporting log devices.

315072 11-Mar-2017 avg

MFC r314274: l2arc: fix write size calculation broken by Compressed ARC commit

314899 08-Mar-2017 ae

MFC r314497:
Do not invoke the resize event when previous provider's size was zero.
This is similar to r303637 fix for geom_disk.

314873 07-Mar-2017 jpaetzel

MFC 313879

MVF: 313876

7504 kmem_reap hangs spa_sync and administrative tasks

illumos/illumos-gate@405a5a0f5c3ab36cb76559467d1a62ba648bd809
https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80

https://www.illumos.org/issues/7504

We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

314858 07-Mar-2017 avg

MFC r314058: zfs: lower priority of zio_write_issue threads by four

Obtained from: Panzura
Sponsored by: Panzura

314710 05-Mar-2017 mm

MFC r314572:

Fix null pointer dereference in zfs_freebsd_setacl().

Prevents unprivileged users from panicking the kernel by calling
__acl_delete_*() on files or directories inside a ZFS mount.

314665 04-Mar-2017 avg

MFC r314273: zfs: call spa_deadman on a taskqueue thread

314355 27-Feb-2017 avg

MFC r314059: zfs: move zio_taskq_basedc under SYSDC

314031 21-Feb-2017 avg

MFC r313687: remove l2_padding_needed statistic from zfs arc

314030 21-Feb-2017 avg

MFC r313686: check remaining space in zfs implementations of vptocnp

PR: 216939

313120 03-Feb-2017 markj

MFC r312893:
Fix an off-by-one in an assertion on fasttrap tracepoint sizes.

311149 03-Jan-2017 markj

MFC r310647:
Remove an obsolete pragma from dtrace.h.

310959 31-Dec-2016 mjg

MFC r305378,r305379,r305386,r305684,r306224,r306608,r306803,r307650,r307685,
r308407,r308665,r308667,r309067:

cache: put all negative entry management code into dedicated functions

==
cache: manage negative entry list with a dedicated lock

Since negative entries are managed with a LRU list, a hit requires a
modificaton.

Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.

Provide a dedicated lock for use when the cache is only shared-locked.

==

cache: defer freeing entries until after the global lock is dropped

This also defers vdrop for held vnodes.

==

cache: improve scalability by introducing bucket locks

An array of bucket locks is added.

All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.

This is an intermediate step towards removal of the global lock.

==

cache: get rid of the global lock

Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.

Lookups still require the relevant bucket to be locked.

==

cache: ignore purgevfs requests for filesystems with few vnodes

purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.

In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.

Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.

The default limit is the number of bucket locks.

== (by kib)

Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

==

cache: split negative entry LRU into multiple lists

This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.

Create N dedicated lists for new negative entries.

Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.

Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.

==

cache: fix up a corner case in r307650

If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.

Fix the problem by initializing to NULL on entry.

== (by kib)

vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.

==

cache: fix a race between entry removal and demotion

The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

==

cache: plug a write-only variable in cache_negative_zap_one

==

cache: ensure that the number of bucket locks does not exceed hash size

The size can be changed by side effect of modifying kern.maxvnodes.

Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.

Force the number of buckets to be not smaller.

Note this should not matter for real world cases.

310795 30-Dec-2016 gnn

MFC: 310175

Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro

Sponsored by: DARPA, AFRL

310515 24-Dec-2016 avg

MFC r309250: MFV r309249: 3821 Race in rollback, zil close, and zil flush

310513 24-Dec-2016 avg

MFC r309099: MFV r308990: 7181 race between zfs_mount and zfs_ioc_rollback

310511 24-Dec-2016 avg

MFC r309098: MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try
to free already free blocks

310509 24-Dec-2016 avg

MFC r309097: MFV r308987: 7180 potential race between
zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename

310335 20-Dec-2016 gnn

MFC: 309669

Fix a kernel panic in DTrace's rw_iswriter subroutine.
On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed,
probably due to a cut and paste error. Using rw_iswriter() would cause
the kernel to panic.

Reviewed by: markj
Sponsored by: DARPA, AFRL

310328 20-Dec-2016 gnn

MFC: 309069

Add tunable to disable destructive dtrace

Submitted by: Joerg Pernfuss <code.jpe@gmail.com>
Reviewed by: rstone, markj

310105 15-Dec-2016 mav

MFC 309714: Fix spa_alloc_tree sorting by offset in r305331.

Original commit "7090 zfs should improve allocation order" declares alloc
queue sorted by time and offset. But in practice io_offset is always zero,
so sorting happened only by time, while order of writes with equal time was
completely random. On Illumos this did not affected much thanks to using
high resolution timestamps. On FreeBSD due to using much faster but low
resolution timestamps it caused bad data placement on disks, affecting
further read performance.

This change switches zio_timestamp_compare() from comparing uninitialized
io_offset to really populated io_bookmark values. I haven't decided yet
what to do with timestampts, but on simple tests this change gives the
same peformance results by just making code to work as declared.

310066 14-Dec-2016 avg

MFC r308887,309090: fix unsafe modification of zfs_vnodeops when
DIAGNOSTIC is enabled

308914 21-Nov-2016 avg

MFC r308089: zfsbootcfg: a simple tool to set next boot (one time)
options for zfsboot

308595 13-Nov-2016 mav

MFC r308173:
Fix ZIL records ordering when ZVOL opened both with and without FSYNC.

Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.

308593 12-Nov-2016 mav

MFC r308169:
Pass to zvol_log_truncate() same sync values as to zvol_log_write().

Surplus marking of TX_TRUNCATE records as sync could result in putting them
into ZIL before previous writes if ones were async.

308591 12-Nov-2016 mav

MFC r308055: Add vdev_reopening support to vdev_geom.

It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size. If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.

308589 12-Nov-2016 mav

MFC r308051: Matching GUIDs, handle possible race on vdev detach.

In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.

This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.

Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend. During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.

308587 12-Nov-2016 mav

MFC r308049: Improve few debugging log messages.

308585 12-Nov-2016 mav

MFC r307318: MFV r307314:
6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates

Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Closes #107

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@5fc46359c569369d87728ca09f8705cdff6cc8e2

308447 08-Nov-2016 mav

MFC r307857: Fix panic after ZVOL renamed to name invalid for DEVFS.

308245 03-Nov-2016 avg

MFC r307994: 3746 ZRLs are racy

PR: 204037

308085 29-Oct-2016 mav

MFC r306456: Add #ifdef _KERNEL around send_holes_without_birth_time sysctl.

308084 29-Oct-2016 mav

MFC r306425: MFV r306423:
7402 Create tunable to ignore hole_birth feature

Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature. Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.

Closes #188

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

308082 29-Oct-2016 mav

MFC r306424: MFV r306422:
7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs

dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.

Closes #180

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

307995 27-Oct-2016 avg

MFC r306801: implement zfs_vptocnp() using z_parent property

307671 20-Oct-2016 kib

MFC r307218:
Fix a race in vm_page_busy_sleep(9).

307299 14-Oct-2016 mav

MFC r305563: MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused

The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:

commit ca0cc3918a1789fa839194af2a9245f801a06b1a
Author: Matthew Ahrens <mahrens@delphix.com>
Date: Fri Jul 24 09:53:55 2015 -0700

5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

This patch simply removes this macro from dsl_dataset.h.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307297 14-Oct-2016 mav

MFC r305561: MFV r305560:
7278 tuning zfs_arc_max does not impact arc_c_min

When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@608764beadaf4bb71c5d8fe1818e8392ac66a61b

307295 14-Oct-2016 mav

MFC r305456 (by avg): fix zfs pool creation accidentally broken by r305331

The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.

307293 14-Oct-2016 mav

MFC r305342: Missed FreeBSD-specific piece of r305338.

307290 14-Oct-2016 mav

MFC r305340: MFC r305337:
7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object

Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489a169ab36f9ec1b2a111a60a5563a9f

307288 14-Oct-2016 mav

MFC r305339: MFV r305336: 7247 zfs receive of deduplicated stream fails

This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.

Closes #159

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>

openzfs/openzfs@b09697c8c18be68abfe538de9809938239402ae8

307286 14-Oct-2016 mav

MFC r305338: MFV r305335: 7003 zap_lockdir() should tag hold

zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5a2c13e04328b39ecd2a6d0d3c4f7cb

307284 14-Oct-2016 mav

MFC r305334: MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3b10689fc45f4930d2392f5fe1d9cfa
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
A test failure occurred where a send stream had only a BEGIN record. This
should not be possible if the send returns without error. Prevented this from
happening in the future by adding an assertion to dmu_send_impl() to verify
that if the function returns 0 (success) both a BEGIN and END record are
present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
END records sent), having dump_record() set flags appropriately, adding VERIFY
statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>

307282 14-Oct-2016 mav

MFC r305333: MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr

illumos/illumos-gate@bd56f80007857b960e0981ed0797ad8ec844a96b
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307277 14-Oct-2016 mav

MFC r305331: MFV r304155:
7090 zfs should improve allocation order and throttle allocations

illumos/illumos-gate@0f7643c7376dd69a08acbfc9d1d7d548b10c846a
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
will drive them asynchronously through the allocation stage which can result i
n
blocks being allocated out-of-order. It would be nice to preserve as much of
the logical order as possible.
In addition, the allocations are equally scattered across all top-level VDEVs
but not all top-level VDEVs are created equally. The pipeline should be able t
o
detect devices that are more capable of handling allocations and should
allocate more blocks to those devices. This allows for dynamic allocation
distribution when devices are imbalanced as fuller devices will tend to be
slower than empty devices.
The change includes a new pool-wide allocation queue which would throttle and
order allocations in the ZIO pipeline. The queue would be ordered by issued
time and offset and would provide an initial amount of allocation of work to
each top-level vdev. The allocation logic utilizes a reservation system to
reserve allocations that will be performed by the allocator. Once an allocatio
n
is successfully completed it's scheduled on a given top-level vdev. Each top-
level vdev maintains a maximum number of allocations that it can handle
(mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
groups and round robin across all eligible metaslab groups to distribute the
work. As top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>

307269 14-Oct-2016 mav

MFC r305325: MFV r303078:
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer

illumos/illumos-gate@926549256b71acd595f69b236779ff6b78fa08ef
https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef

https://www.illumos.org/issues/7086
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
db_blkptr, to prevent it from being changed by syncing context.
Otherwise we may see that ztest got a segfault from this stack:
libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
ztest_dmu_object_alloc_free+0x71(ea294588, 13)
ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
ztest_execute+0x89(a, 807c720, 13, 0)
ztest_thread+0xea(13, 0, 0, 0)
libc.so.1`_thrp_setup+0x88(f0983240)
libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
Looking into it a bit, we see that this is an embedded blockpointer, so
BP_GET_NDVAS should have returned 0:
b32b240::blkptr
EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
Instead, it looks like another thread is modifying this blockpointer:
b32b240::ugrep | ::whatis
f47a0e0c is in [ stack tid=0x19f ]
ebd6ec40 is in [ stack tid=0x226 ]
ea293bd0 is in [ stack tid=0x244 ]
ea293be4 is in [ stack tid=0x244 ]

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307267 14-Oct-2016 mav

MFC r305324: MFV r303077:
7072 zfs fails to expand if lun added when os is in shutdown state

illumos/illumos-gate@c39a2aae1e2c439d156021edfc20910dad7f9891
https://github.com/illumos/illumos-gate/commit/c39a2aae1e2c439d156021edfc20910dad7f9891

https://www.illumos.org/issues/7072
upstream:
38733 zfs fails to expand if lun added when os is in shutdown state
DLPX-36910 spares and caches should not display expandable space
DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>

307265 14-Oct-2016 mav

MFC r305323: MFV r302991: 6950 ARC should cache compressed data

illumos/illumos-gate@dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2
https://github.com/illumos/illumos-gate/commit/dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2

https://www.illumos.org/issues/6950
When reading compressed data from disk, the ARC should keep the compressed
block cached and only decompress it when consumers access the block. The
uncompressed data should be short-lived allowing the ARC to cache a much larger
amount of data. The DMU would also maintain a smaller cache of uncompressed
blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>

307142 12-Oct-2016 avg

MFC r306665: zfs: fix a wrong assertion for extended attributes

PR: 213112

307113 12-Oct-2016 mav

MFC r305224: MFV r304158:
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information

7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often

illumos/illumos-gate@b72b6bb10ad55121a1b352c6f68ebdc8e20c9086
https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e
20c9086

https://www.illumos.org/issues/7136
6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
whenever an aux device gets removed from a pool. However, those sysevents will
be created without the vdev_guid and vdev_path fields. It would be better to
always populate those fields.

https://www.illumos.org/issues/7115
The addition of spa_event_notify in vdev removal code (see #6922) causes event
s
to be generated even if the spare failed to be removed with EBUSY.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>

307112 12-Oct-2016 mav

MFC r305222: MFV r302993: 7104 increase indirect block size

illumos/illumos-gate@4b5c8e93cab28d3c65ba9d407fd8f46e3be1db1c
https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c

https://www.illumos.org/issues/7104
The current default indirect block size is 16KB. We can improve
performance by increasing it to 128KB. This is especially helpful for
any workload that needs to read most of the metadata, e.g.
scrub/resilver, file deletion, filesystem deletion, and zfs send.
We also need to fix a few space estimation errors to make the tests
pass.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307111 12-Oct-2016 mav

MFC r305221: MFV r302992: 7071 lzc_snapshot does not fill in errlist on ENOENT

illumos/illumos-gate@25f7d993adbfb3452ac4625b3791670746d35ae3
https://github.com/illumos/illumos-gate/commit/25f7d993adbfb3452ac4625b379167074
6d35ae3

https://www.illumos.org/issues/7071
upstream
DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307109 12-Oct-2016 mav

MFC r305210: MFV r302661:
7082 bptree_iterate() passes wrong args to zfs_dbgmsg()

illumos/illumos-gate@10e67aa0db0823d5464aafdd681f3c966155c68e
https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966
155c68e

https://www.illumos.org/issues/7082
upstream
DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg()

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307108 12-Oct-2016 mav

MFC r305209: MFV r302660: 6314 buffer overflow in dsl_dataset_name

illumos/illumos-gate@9adfa60d484ce2435f5af77cc99dcd4e692b6660
https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6
92b6660

https://www.illumos.org/issues/6314
Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>


/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zdb/zdb.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs/zfs_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zhack/zhack.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zpool/zpool_main.c
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/ztest/ztest.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_changelist.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_dataset.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_diff.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_impl.h
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_iter.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_mount.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_pool.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs/common/libzfs_sendrecv.c
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs_core/common/libzfs_core.c
/freebsd-11-stable/cddl/usr.sbin/zfsd/tests/zfsd_unittest.cc
/freebsd-11-stable/sys/cddl/contrib/opensolaris/common/zfs/zfs_namecheck.c
common/fs/zfs/dmu_objset.c
common/fs/zfs/dmu_send.c
common/fs/zfs/dsl_bookmark.c
common/fs/zfs/dsl_dataset.c
common/fs/zfs/dsl_deleg.c
common/fs/zfs/dsl_dir.c
common/fs/zfs/dsl_prop.c
common/fs/zfs/dsl_scan.c
common/fs/zfs/dsl_userhold.c
common/fs/zfs/spa.c
common/fs/zfs/spa_history.c
common/fs/zfs/sys/dmu.h
common/fs/zfs/sys/dsl_dataset.h
common/fs/zfs/sys/dsl_dir.h
common/fs/zfs/sys/spa_impl.h
common/fs/zfs/sys/zap.h
common/fs/zfs/sys/zfs_znode.h
common/fs/zfs/zfs_ctldir.c
common/fs/zfs/zfs_ioctl.c
common/fs/zfs/zfs_vfsops.c
common/fs/zfs/zil.c
common/sys/fs/zfs.h
307101 12-Oct-2016 mav

MFC r305197: MFV r302646:
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch

illumos/illumos-gate@ea4a67f462de0a39a9adea8197bcdef849de5371
https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84
9de5371

https://www.illumos.org/issues/6980
doing zfs send -i snap1 snap2 >testfile results in
internal error: Invalid argument
Abort (core dumped)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307049 11-Oct-2016 mav

MFC r305200: MFV r302651:
7054 dmu_tx_hold_t should use refcount_t to track space

illumos/illumos-gate@0c779ad424a92a84d1e07d47cab7f8009189202b
https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009
189202b

https://www.illumos.org/issues/7054
upstream:
ee0003de7d3e598499be7ac3fe6b61efcc47cb7f
DLPX-40399 dmu_tx_hold_t should use refcount_t to track space

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

307048 11-Oct-2016 mav

MFC r305199: MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set

Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.

illumos/illumos-gate@45b1747515a17db45e8971501ee84a26bdff37b2
https://github.com/illumos/illumos-gate/commit/45b1747515a17db45e8971501ee84a26bdff37b2

https://www.illumos.org/issues/7019
Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
skips all processing of secpolicy functions. This means that ZFS is not doing
any kind of verification of the credentials or access rights of the caller and
assuming that (as it is an in-kernel client) all such checks have already been
done.
This turns out to be quite a dangerous assumption, especially with respect to
sdev. In general I don't think it's particularly reasonable to offload this
enforcement of access rights onto other kernel subsystems when ZFS has some
particular local semantics in this area (delegated datasets etc) and does not
provide any kind of API to allow other subsystems to avoid code duplication
when doing it. ZFS should apply its normal access policy to requests from
within the kernel, and callers should take care to give it the correct
credentials and call it from the correct context in order to get the results
they need.
You can observe the currently unfortunate consequences of this bug in any non-
global zone that has access to /dev/zvol or any subset of it via sdev profiles.
In particular, a zone used to contain a KVM or similar which has a single zvol
passed through to it using a <device match= block in its zone XML.
Even though sdev makes something of an attempt to control for whether the
caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
or really at all in the lookup call path. So, if we have a zone that's been
given access to any part of /dev/zvol, it can simply look up the full path to
any other zvol on the entire system, and the node will appear and be able to be
used.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>

307047 11-Oct-2016 mav

MFC r305198: MFV r302647:
6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device

illumos/illumos-gate@63364b0ee2604783e7a55f8425888867768eafa4
https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677
68eafa4

https://www.illumos.org/issues/6922
ZFS does not do a config_sync after removing an aux (spare, log, or cache)
device. AFAICT this isn't being done because it is slow and was deemed
unnecessary. However, it should be such a rare operation that speed doesn't
matter, and not doing it results in two problems:
1) It is theoretically possible to remove an aux device from one pool and
attach it to another, then lose power. When power is restored, both pools woul
d
think that they own the aux device.
2) Removal of the aux device doesn't send any useful sysevents to userland.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>

307046 11-Oct-2016 mav

MFC r305195: MFV r302643:
6902 speed up listing of snapshots if requesting name only and sorting by name

This was our change from the beginning, so just reduce the upstream diff.

307045 11-Oct-2016 mav

MFC r305193: MFV r302642:
6876 Stack corruption after importing a pool with a too-long name

illumos/illumos-gate@c971037baa5d64dfecf6d87ed602fc3116ebec41
https://github.com/illumos/illumos-gate/commit/c971037baa5d64dfecf6d87ed602fc3116ebec41

https://www.illumos.org/issues/6876
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for
trouble. We should check every dataset on import, using a 1024 byte buffer and
checking each time to see if the dataset's new name is longer than 256 bytes.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>

306818 07-Oct-2016 avg

MFC r306292: fix vnode lock assertion for extended attributes directory

306572 02-Oct-2016 markj

MFC r306304:
Move implementations of uread() and uwrite() to the illumos compat layer.

305799 14-Sep-2016 mav

MFC r305123: Fix kernel panic when inheriting properties without default.

There are two writable hidden properties "iscsioptions" and "stmf_sbd_lu",
that have no default string value. Attempt to unset them or replicate
caused kernel panic. This simple bandaid seems fixes the problem nicely.

305271 02-Sep-2016 ngie

MFC r303576:

Conditionalize code which defines sysctls per _KERNEL #ifdef guard

This resolves several issues when compiling libzpool (userspace library), i.e.
-Wimplicit-function-declaration and -Wmissing-declarations issues.

Tested with: clang 3.8.1, gcc 4.2.1, gcc 5.3.0

304138 15-Aug-2016 avg

MFC r302838: 6513 partially filled holes lose birth time

304137 15-Aug-2016 avg

MFC r302837: 6844 dnode_next_offset can detect fictional holes

304127 15-Aug-2016 avg

MFC r302840: 6878 Add scrub completion info to "zpool history"

304122 15-Aug-2016 avg

MFC r302839: 6940 Cannot unlink directories when over quota

303970 11-Aug-2016 avg

MFC r303763,303791,303869: zfs: honour and make use of vfs vnode locking protocol

ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS. Most importantly many things that FreeBSD VFS
manages on behalf of all filesystems are implemented in ZPL in a different
way.
Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS
functionality or, in the worst cases, badly interacts / interferes
with VFS.

The most prominent problem is a deadlock caused by the lock order reversal
of vnode locks that may happen with concurrent zfs_rename() and lookup().
The deadlock is a result of zfs_rename() not observing the vnode locking
contract expected by VFS.

This commit removes all ZPL internal locking that protects parent-child
relationships of filesystem nodes. These relationships are protected
by vnode locks and the code is changed to take advantage of that fact
and to properly interact with VFS.

Removal of the internal locking allowed all ZPL dmu_tx_assign calls to
use TXG_WAIT mode.

Another victim, disputable perhaps, is ZFS support for filesystems with
mixed case sensitivity. That support is not provided by the OS anyway,
so in ZFS it was a buch of dead code.

To do:
- replace ZFS_ENTER mechanism with VFS managed / visible mechanism
- replace zfs_zget with zfs_vget[f] as much as possible
- get rid of not really useful now zfs_freebsd_* adapters
- more cleanups of unneeded / unused code
- fix / replace .zfs support

PR: 209158
Approved by: re (gjb)

303969 11-Aug-2016 avg

MFC r302836: 6874 rollback and receive need to reset ZPL state to what's on disk

Approved by: re (gjb)

302994 18-Jul-2016 avg

MFC r302772: re-apply r299908: zfsctl_snapdir_lookup: clear VV_ROOT of
snapshot's root

The change has been undone in r301275 on the assumption that it was no
longer required. But that was incorrect, because in this case (and only
in this case) the snapshot root vnode is looked up before z_parent is
fixed up.

Approved by: re (delphij)

302913 15-Jul-2016 markj

MFC r302507:
Avoid truncating the return value of DTrace predicates.

Approved by: re (gjb)

302408 08-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
302382 07-Jul-2016 smh

Fix ZFS ARC min / max tunable

Due to ARC initial configuration not being done and kmem information
not being available we need to blindly set zfs_arc_max and zfs_arc_min
when configured via the tunable.

This fixes vfs.zfs.arc_(min|max) configuration via loader.conf broken by
r302265.

Approved by: re(gjb)
MFC after: 1 week


302297 30-Jun-2016 mav

Revert r299454 and r299448.

Those changes were found confusing FreeBSD libc ACL code, that doesn't
differentiate ACL for directories and files, and report ACLs for all
directories created after those patches as non-trivial. On the other
side these changes were considered wrong from POSIX and NFSv4 points of
view. Until further investigation done upstream, revert those changes
locally in preparation for FreeBSD 11.0 release.

Approved by: re (hrs)


302265 29-Jun-2016 smh

Allow ZFS ARC min / max to be tuned at runtime

Prior to this change ZFS ARC min / max could only be changed using
boot time tunables, this allows the values to be tuned at runtime
using the sysctls:
* vfs.zfs.arc_max
* vfs.zfs.arc_min

When adjusting ZFS ARC minimum the memory used will only reduce
to the new minimum given memory pressure.

Reviewed by: allanjude
Approved by: re (gjb)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D5907


302123 23-Jun-2016 avg

fix deadlock-prone code in getzfsvfs()

getzfsvfs() called vfs_busy() in the waiting mode while having a hold on
a pool (via a call to dmu_objset_hold). In other words,
dp_config_rwlock was held in the shared mode while a thread could be
sleeping in vfs_busy().
The pool's txg sync thread needs to take dp_config_rwlock in the
exclusive mode for some actions, e.g., for executing sync tasks. If the
sync thread gets blocked, then any thread waiting for its sync task to
get executed is also blocked. Which, in turn, could mean that
vfs_busy() will keep waiting indefinitely.

The solution is to use vfs_ref() in the locked section and to call
vfs_busy() only after dropping other locks.
Note that a reference on a struct mount object does not prevent an
associated zfsvfs_t object from being destroyed. So, we have to be
careful to operate only on the struct mount object until we successfully
vfs_busy it.

Approved by: re (gjb)
MFC after: 2 weeks


302058 21-Jun-2016 asomers

Fix uninitialized variable from r300881

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Initialize needs_update in vdev_geom_set_physpath

PR: 210409
Reported by: kp
Reviewed by: kp
Approved by: re (hrs)
MFC after: 4 weeks
X-MFC-With: 300881
Sponsored by: Spectra Logic Corp


302012 18-Jun-2016 kib

Fix gcc build.

Reported andt tested by: swills
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


301997 17-Jun-2016 kib

Use vnlru_free(9) to implement dnlc_reduce_cache().

This apparently puts ARC back under the limits after the vnode pressure
rework in r291244, in particular due to the kmem exhaustion.

Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


301873 13-Jun-2016 avg

l2arc: reset b_tmp_cdata to NULL in the case of unset b_daddr

The change is in arc_buf_l2_cdata_free().
Without this we can trip the assertion in arc_hdr_realloc()
if INVARIANTS option is enabled.

Approved by: re (kib)
MFC after: 1 week


301870 13-Jun-2016 avg

zfs_vptocnp: check for an invalid znode

... which can arise after the receive or rollback
and failed zfs_rezget().

Approved by: re (kib)
MFC after: 1 week


301275 03-Jun-2016 avg

zfs: set VROOT / VV_ROOT consistently and in a single place

This is a followup to r300131.

A filesystem's root vnode can be reached not only through VSF_ROOT, but
by other means as well. For example, via a dot-dot lookup.
Also, a root vnode can get reclaimed and then re-created. For these
reasons it was insufficient to clear VV_ROOT flag from a root vnode of a
snapshot mounted under .zfs in zfsctl_snapdir_lookup().

So, now we set the flag in zfs_znode_sa_init() only if a vnode
represent a root of a filesystem or a standalone snapshot.
That is, the flag is not set for snapshots mounted under .zfs.

MFC after: 2 weeks


301273 03-Jun-2016 avg

zfs_root: fix a potential root vnode reference leak

It could happen in an unlikely case that we fail to lock the root vnode
with requested flags (which appear to never include LK_NOWAIT).

MFC after: 1 week


301174 01-Jun-2016 asomers

Improve the English in a comment

sys/cddl/contrib/opensolaris/uts/common/sys/acl.h:
Improve the english in a comment. No functional changes

Submitted by: gibbs
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp


301010 31-May-2016 allanjude

Connect the SHA-512t256 and Skein hashing algorithms to ZFS

Support for the new hashing algorithms in ZFS was introduced in r289422
However it was disconnected because FreeBSD lacked implementations of
SHA-512 (truncated to 256 bits), and Skein.

These implementations were introduced in r300921 and r300966 respectively

This commit connects them to ZFS and enabled these new checksum algorithms

This new algorithms are not supported by the boot blocks, so do not use them
on your root dataset if you boot from ZFS.

Relnotes: yes
Sponsored by: ScaleEngine Inc.


300919 29-May-2016 bdrewery

Avoid more literal-suffix errors with C++11


300906 28-May-2016 asomers

zfsd(8), the ZFS fault management daemon

Add zfsd, which deals with hard drive faults in ZFS pools. It manages
hotspares and replements in drive slots that publish physical paths.

cddl/usr.sbin/zfsd
Add zfsd(8) and its unit tests

cddl/usr.sbin/Makefile
Add zfsd to the build

lib/libdevdctl
A C++ library that helps devd clients process events

lib/Makefile
share/mk/bsd.libnames.mk
share/mk/src.libnames.mk
Add libdevdctl to the build. It's a private library, unusable by
out-of-tree software.

etc/defaults/rc.conf
By default, set zfsd_enable to NO

etc/mtree/BSD.include.dist
Add a directory for libdevdctl's include files

etc/mtree/BSD.tests.dist
Add a directory for zfsd's unit tests

etc/mtree/BSD.var.dist
Add /var/db/zfsd/cases, where zfsd stores case files while it's shut
down.

etc/rc.d/Makefile
etc/rc.d/zfsd
Add zfsd's rc script

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
Fix the resource.fs.zfs.statechange message. It had a number of
problems:

It was only being emitted on a transition to the HEALTHY state.
That made it impossible for zfsd to take actions based on drives
getting sicker.

It compared the new state to vdev_prevstate, which is the state that
the vdev had the last time it was opened. That doesn't make sense,
because a vdev can change state multiple times without being
reopened.

vdev_set_state contains logic that will change the device's new
state based on various conditions. However, the statechange event
was being posted _before_ that logic took effect. Now it's being
posted after.

Submitted by: gibbs, asomers, mav, allanjude
Reviewed by: mav, delphij
Relnotes: yes
Sponsored by: Spectra Logic Corp, iX Systems
Differential Revision: https://reviews.freebsd.org/D6564


300884 27-May-2016 ngie

Fix up r300870

The sys/types.h fix I proposed was only tested with zfs(4), not with
libzpool, which is where the build failure actually existed

Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both
unneeded

MFC after: 1 week
X-MFC with: r300865, r300870
In collaboration with: kib
Submitted by: alc
Sponsored by: EMC / Isilon Storage Division


300881 27-May-2016 asomers

Avoid issuing spa config updates for physical path when not necessary

ZFS's configuration needs to be updated whenever the physical path for a
device changes, but not when a new device is introduced. This is because new
devices necessarily cause config updates, but only if they are actually
accepted into the pool.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When
setting the vdev's physical path, only request a config update if
the physical path has changed. Don't request it when opening a
device for the first time, because the config sync will happen
anyway upstack.

sys/geom/geom_dev.c
Split g_dev_set_physpath and g_dev_set_media out of
g_dev_attrchanged

Submitted by: will, asomers
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6428


300870 27-May-2016 ngie

Unbreak the zfs(4) build

vm/vm_pageout.h grew a dependency on the bool typedef in r300865

arc.c didn't include sys/types.h, which included the definition for the typedef

Other items (ofed, drm2) might need to be chased for this commit.

X-MFC with: r300865
MFC after: 1 week
Pointyhat to: alc
Sponsored by: EMC / Isilon Storage Division


300618 24-May-2016 br

Add initial DTrace support for RISC-V.

Sponsored by: DARPA, AFRL
Sponsored by: HEIF5


300145 18-May-2016 avg

add vop_print methods to vnode operatios of various zfsctl node types

This should help with diagnostics of zfsctl problems.

MFC after: 2 weeks


300134 18-May-2016 avg

move zfsctl_freebsd_root_lookup right next to zfsctl_root_lookup

That makes it easier to reason about the code.

MFC after: 5 weeks


300133 18-May-2016 avg

zfsctl_common_fid: remove redundant assignment

"Reinterpret cast" to zfid_short_t and assignment of zf_len
do the job already.

MFC after: 1 week


300132 18-May-2016 avg

zfsctl: tighten an assertion and remove an unused definition

There are only two entries under .zfs and 'shares' has an ID of a
special persistent object in its filesystem.

MFC after: 1 week


300131 18-May-2016 avg

zfs_root: no need to set the root flag here

That was both redundant as zfs_znode_sa_init() already does the job and
insufficient as the root vnode can be reached via other means.

MFC after: 1 weeks


300130 18-May-2016 avg

zfsctl_freebsd_root_lookup: gfs_vop_lookup may return a doomed vnode

gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it
does not handle doomed but not yet dead vnodes and may return them.
Check for those vnodes here and retry a lookup.
Note that ZFS and gfs have additional protections that ensure that a
parent vnode of the current vnode is never doomed.

The fixed problem is an occasional failure to lookup a 'snapshot' or
'shares' directories under .zfs.

Note that for the above reason all uses of zfsctl_root_lookup() are
better be replaced with VOP_LOOKUP.

MFC after: 5 weeks


300059 17-May-2016 asomers

Speed up vdev_geom_open_by_guids

Speedup is hard to measure because the only time vdev_geom_open_by_guids
gets called on many drives at the same time is during boot. But with
vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations
like "zpool create" speed up by 65%.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

* Read all of a vdev's labels in parallel instead of sequentially.
* In vdev_geom_read_config, don't read the entire label, including
the uberblock. That's a waste of RAM. Just read the vdev config
nvlist. Reduces the IO and RAM involved with tasting from 1MB to
448KB.

Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6153


300024 17-May-2016 avg

zfs_ioc_rename: fix a reversed condition

FreeBSD zfs_ioc_rename() has an option, not present upstream, that
allows to rename snapshots without unmounting them first. I am not sure
what is a rationale for that option, but its actual behavior was the
opposite of the intended behavior. That is, by default the snapshots
were not unmounted.
The option was introduced as part of a large update from upstream in
r248498.

One of the consequences was a havoc under .zfs/snapshot after the rename.
The snapshots got new names but were mounted on top of directories with
old names, so readdir would list the new names, but lookup would still
find the old mounts.

PR: 209093
Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com>
MFC after: 5 days


299951 16-May-2016 avg

do not destroy 'snapdir' when it becomes inactive

That was just wrong. In fact, we can safely keep this static entry when
it's inactive.
Now the destructive action is moved to the reclaim method and the
function is renamed from zfsctl_snapdir_inactive(0 to
zfsctl_snapdir_reclaim().

Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() +
kmem_free().

Lastly, we can just assert that the node does not any children when it
is reclaimed, even on the force unmount. That's because zfs_umount()
does an extra vflush() pass which should destroy all snapshot-mountpoint
vnodes that are the snapdir's children.

MFC after: 5 weeks


299949 16-May-2016 avg

try to recycle "snap" vnodes as soon as possible

Those vnodes should not linger. "Stale" nodes may get out of
synchronization with actual snapshots. For example if we destroy a
snapshot and create a new one with the same name. Or when we rename a
snapshot.

While there fix the argument type for zfsctl_snapshot_reclaim().
Also, its original argument can be passed to gfs_vop_reclaim() directly.

Bug 209093 could be related although I have not specifically verified
that. Referencing just in case.

PR: 209093
MFC after: 5 weeks


299947 16-May-2016 avg

fix locking in zfsctl_root_lookup

Dropping the root vnode's lock after VFS_ROOT() didn't really help the
fact that we acquired the lock while holding its child's, .zfs, lock
while performing the operaiton.
So, directly use zfs_zget() to get the root vnode.

While there simplify the code in zfsctl_freebsd_root_lookup.
We know that .zfs is always exclusively locked.
We know that there is already a reference on *vpp, so no need for an
extra one.
Account for the fact that .. lookup may ask for a different lock type,
not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire
the lock given the lock flags.

MFC after: 5 weeks


299946 16-May-2016 avg

gfs_lookup_dot() does not have to acquire any locks

In fact, that was dangerous. For example, zfsctl_snapshot_reclaim()
calls gfs_dir_lookup() on ".." path and that ends up calling
gfs_lookup_dot() which violated locking order by acquiring the parent's
directory vnode lock after the child's vnode lock.

Also, the previous behavior was inconsistent as gfs_dir_lookup()
returned a locked vnode for . and .. lookups, but not for any other.

Now gfs_lookup_dot() just references a resulting vnode and the locking
is done in its consumers, where necessary.
Note that we do not enable shared locking support for any gfs / zfsctl
vnodes.

This commit partially reverts r273641.

MFC after: 5 weeks


299945 16-May-2016 avg

avoid deadlock between zfsctl_snapdir_lookup and zfsctl_snapshot_reclaim

The former acquired a snap vnode lock while holding sd_lock while the
latter does the opposite.

The solution is drop sd_lock before acquiring the vnode lock. That
should be okay as we are still holding a lock on the 'snapshot'
directory in the exclusive mode. That lock ensures that there are no
concurrent lookups in the directory and thus no concurrent mount attempts.

But now we have to account for the possibility that the snap vnode
might get reclaim after we drop sd_lock and before we can get
the node lock. So, check for that case and retry.

MFC after: 5 weeks


299940 16-May-2016 avg

fix a vnode reference leak caused by illumos compat traverse()

This commit partially reverts r273641 which introduced the leak.
It did so to accomodate for some consumers of traverse() that expected
the starting vnode to stay as-is. But that introduced the leak in the
case when a mounted filesystem was found and its root vnode was
returned.

r299914 removed the troublesome consumers and now there is no reason to
keep the starting vnode. So, now the new rules are:
- if there is no mounted filesystem, then nothing is changed
- otherwise the starting vnode is always released
- the root vnode of the mounted filesystem is returned locked and
referenced in the case of success

MFC after: 5 weeks
X-MFC after: r299914


299938 16-May-2016 avg

fix up r299902: mount_snapshot requires that the covered vnode is locked

Previously that was not strictly enforced.

MFC after: 4 weeks
X-MFC with: r299902


299914 16-May-2016 avg

zfsctl_ops_snapshot: remove methods should never be called

We pretend that snapshots mounted under .zfs are part of the original
filesystem and we try very hard to hide vnodes on top of which the snapshots
are mounted. Given that I believe that the removed operations should
never be called. They might have been called previously because
of issues fixed in r299906, r299908 and r299913.

MFC after: 5 weeks


299908 16-May-2016 avg

zfsctl_snapdir_lookup: always clear VV_ROOT flag of snapshot's root VV_ROOT

Previosuly we did that only if the snapshot was mounted earlier, its
root vnode got recycled and then we accessed it again.
We never cleared the flag for a freshly mounted snapshot.

That was very inconsistent and probably a source of some bugs.
Or maybe that painted over some bugs which might get revealed now.

We should consistently clear the flag because we try very hard to
pretend that snapshots auto-mounted under .zfs are part of their
original filesystem. In other words, we try to hide the fact that they
are different filesystems / mountpoints.

MFC after: 5 weeks


299906 16-May-2016 avg

add zfs_vptocnp with special handling for snapshots under .zfs

The logic is similar to that already present in zfs_dirlook() to handle
a dot-dot lookup on a root vnode of a snapshot mounted under
.zfs/snapshot/.
illumos does not have an equivalent of vop_vptocnp, so there only the
lookup had to be patched up.

MFC after: 4 weeks


299900 16-May-2016 avg

zfsctl: fix several problems with reference counts

* Remove excessive references on a snapshot mountpoint vnode.
zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from
zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD()
on the same vnode.
On top of that gfs_dir_create() already returns the vnode with the
use count of 1 (set in getnewvnode).
So there was 3 references on the vnode.

* mount_snapshot() should keep a reference to a covered vnode.
That reference is owned by the mountpoint (mounted snapshot filesystem).

* Remove cryptic manipulations of a covered vnode in zfs_umount().
FreeBSD dounmount() already does the right thing and releases the covered
vnode.

PR: 207464
Reported by: dustinwenz@ebureau.com
Tested by: Howard Powell <hpowell@lighthouseinstruments.com>
MFC after: 3 weeks


299454 11-May-2016 mav

MFV r299453: 6765 zfs_zaccess_delete() comments do not accurately reflect
delete permissions for ACLs

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

openzfs/openzfs@a40149b935cbbe87bf95e2cc44b3bc99d400513a


299452 11-May-2016 mav

MFV r299451: 6764 zfs issues with inheritance flags during chmod(2) with
aclmode=passthrough

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Albert Lee <trisk@nexenta.com>

openzfs/openzfs@1bcf0d240bdebed61b4261f7c0ee323e07c8dfac


299450 11-May-2016 mav

MFV r299449: 6763 aclinherit=restricted masks inherited permissions by group
perms (groupmask)

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Albert Lee <trisk@nexenta.com>

openzfs/openzfs@eebb483d0cd68bdc4cf03c01fdeba9af160c17af


299448 11-May-2016 mav

MFV r299442: 6762 POSIX write should imply DELETE_CHILD on directories - and
some additional considerations

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

openzfs/openzfs@d316fffc9c361532a482208561bbb614dac7f916


299441 11-May-2016 mav

MFV r299440: 6736 ZFS per-vdev ZAPs

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Joe Stein <joe.stein@delphix.com>

openzfs/openzfs@215198a6ad15cf4832370e2f19247abeb36b951a


299439 11-May-2016 mav

MFV r299438: 6842 Fix empty xattr dir causing lockup

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chunwei Chen <tuxoko@gmail.com>

openzfs/openzfs@02525cd08fb3730fff3a69cb5376443d481f7839


299437 11-May-2016 mav

MFV r299436: 6843 Make xattr dir truncate and remove in one tx

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chunwei Chen <tuxoko@gmail.com>

openzfs/openzfs@399cc7d5d9aff97c714b708af3e3f0280ceab93f


299435 11-May-2016 mav

MFV r299434: 6841 Undirty freed spill blocks

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Tim Chase <tim@chase2k.com>

openzfs/openzfs@445e67805dd2ca6c3a2363b2ec9e163c62370233


299118 05-May-2016 br

Implement FBT provider (MD part) for DTrace on MIPS.
Tested on MIPS64.

Sponsored by: DARPA, AFRL
Sponsored by: HEIF5


298814 29-Apr-2016 asomers

Fix a use-after-free when "zpool import" fails

clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach.
In the latter function, it would fail to happen in certain circumstances
where cp->private was unset. Ideally, the latter should never happen, but
it can happen when vdev open fails, or where spares are involved.

MFC after: 4 weeks
X-MFC-With: 298786
Sponsored by: Spectra Logic Corp


298786 29-Apr-2016 asomers

Refactor vdev_geom_attach and friends to reduce code duplication

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Move checks for provider's sectorsize and mediasize into a single
location in vdev_geom_attach. Remove the zfs::vdev::taste class;
it's ok to use the regular vdev class for tasting. Consolidate guid
checks into a single location in vdev_attach_ok. Consolidate some
error handling code from vdev_geom_attach into vdev_geom_detach,
closing a resource leak of geom consumers in the process.

Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5974


298590 25-Apr-2016 markj

Increase DTRACE_FUNCNAMELEN from 128 to 192.

This allows for the long function components encountered in www/firefox.
This constant is part of DTrace's userland ABI, so this change may not be
MFC'ed.

PR: 207735


298589 25-Apr-2016 markj

Allow DOF sections with excessively long probe function components.

Without this change, DTrace will refuse to load a DOF section if the
function component of any of its probes exceeds DTRACE_FUNCNAMELEN (128).
Probes in C++ programs can have very long function components. Rather than
rejecting all probes if a single probe exceeds the limit, simply skip the
invalid probe and emit a warning. This ensures that valid probes are
instantiated.

PR: 207735
MFC after: 2 weeks


298472 22-Apr-2016 avg

MFV r298471: 6052 decouple lzc_create() from the implementation details

illumos/illumos-gate@26455f9efcf9b1e44937d4d86d1ce37b006f25a9
https://github.com/illumos/illumos-gate/commit/26455f9efcf9b1e44937d4d86d1ce37b006f25a9

https://www.illumos.org/issues/6052
At the moment type parameter of lzc_create() is of dmu_objset_type_t type.
That exposes an implementation detail and requires sys/fs/zfs.h to be included
in libzfs_core.h creating unnecessary coupling between libzfs_core interface
and ZFS internals.
I think that dmu_objset_type_t should be replaced with a libzfs_core
enumeration of supported dataset types.
For ABI reasons the new enumeration could be bit-compatible with
dmu_objset_type_t.
For example:
typedef enum {
LZC_DST_ZFS = 2,
LZC_DST_ZVOL
} lzc_dataset_type_t;

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after: 2 weeks
Sponsored by: ClusterHQ


298171 17-Apr-2016 markj

Make the second argument of dtrace_invop() a trapframe pointer.

Currently this argument is a pointer into the stack which is used by FBT
to fetch the first five probe arguments. On all non-x86 architectures it's
simply the trapframe address, so this change has no functional impact. On
amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the
first five argument registers, which are deliberately grouped together in
the amd64 trapframe definition.

A trapframe argument simplifies the invop handlers on !x86 and makes the
x86 FBT invop handler easier to understand. Moreover, it allows for invop
handlers that may want to modify the register set of the interrupted thread.


298106 16-Apr-2016 avg

zfs_rezget: z_vnode can not be NULL if zp is valid

MFC after: 3 weeks


298105 16-Apr-2016 avg

zfs: enable vn_io_fault support

Note that now we have to account for possible partial writes
in dmu_write_uio_dbuf(). It seems that on illumos either all or none
of the data are expected to be written. But the partial writes are
quite expected when vn_io_fault support is enabled.

Reviewed by: kib
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D2790


298072 15-Apr-2016 asomers

Don't corrupt ZFS label's physpath attribute when booting while a disk is missing

Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach
prior to verifying the device's GUIDs. vdev_geom_attach calls
vdev_geom_attrchange to set the physpath in the vdev object. The result is
that if the disk could not be found, then the labels for other disks in the
same TLD would overwrite the missing disk's physpath with the physpath of
whichever disk currently has the same devname as the missing one used to
have.

MFC after: 4 weeks
Sponsored by: Spectra Logic Corp


298017 14-Apr-2016 asomers

Add more debugging statements in vdev_geom.c

Log a debugging message whenever geom functions fail in vdev_geom_attach.
Printing these messages is controlled by vfs.zfs.debug

MFC after: 4 weeks
Sponsored by: Spectra Logic Corp


297986 14-Apr-2016 asomers

Update a debugging message in vdev_geom_open_by_guids for consistency with
similar messages elsewhere in the file.

MFC after: 4 weeks
Sponsored by: Spectra Logic Corp


297868 12-Apr-2016 asomers

Fix rare double free in vdev_geom_attrchanged

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Don't drop the g_topology_lock before freeing old_physpath. That
opens up a race where one thread can call vdev_geom_attrchanged,
set old_physpath, drop the g_topology_lock, then block trying to
acquire the SCL_STATE lock. Then another thread can come into
vdev_geom_attrchanged, set old_physpath to the same value, and
proceed to free it. When the first thread resumes, it will free
the same location.

It turns out that the SCL_STATE lock isn't needed. It was
originally added by gibbs to protect vd->vdev_physpath while
updating the same. However, the update process subsequently was
switched to an atomic operation (a pointer swap). Now, there is
no need for the SCL_STATE lock, and hence no need to drop the
g_topology_lock.

Reviewed by: delphij
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5413


297848 12-Apr-2016 avg

l2arc: make sure that all writes honor ashift of a cache device

Previously uncompressed buffers did not obey that rule.

Type of b_asize is changed to uint64_t for consistency,
given that this is a zeta-byte filesystem.

l2arc_compress_buf is renamed to l2arc_transform_buf to better reflect
its new utility. Now not only we ensure that a compressed buffer has
a size aligned to ashift, but we also allocate a properly sized
temporary buffer if the original buffer is not compressed and it has
an odd size. This ensures that all I/O to the cache device is always
ashift-aligned, in terms of both a request offset and a request size.

If the aligned data is larger than the original data, then we have to use
a temporary buffer when reading it as well.

Also, enhance physical zio alignment checks using vdev_logical_ashift.
On FreeBSD we have this information, so we can make stricter assertions.

Reviewed by: smh, mav
MFC after: 1 month
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2789


297847 12-Apr-2016 avg

Revert r297396 Modify "4958 zdb trips assert on pools with ashift >= 0xe"

A better fix is following.


297832 11-Apr-2016 mav

MFV r297831: 6322 ZFS indirect block predictive prefetch

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Author: Alexander Motin <mav@FreeBSD.org>

Improve speculative prefetch of indirect blocks.

Scalability of many operations on wide ZFS pool can be limited by
requirement to prefetch indirect blocks first. Recently added
asynchronous indirect block read partially helped, but did not
solve the problem completely. This patch extends existing prefetcher
functionality to explicitly work with indirect blocks.

Before this change prefetcher issued reads for up to 8MB of data in
advance. With this change it also issues indirect block reads
for up to 64MB of data in advance, so that when it will be time to
actually read those data, it can be done immediately. Alike effect
can be achieved by just increasing maximal data prefetch distance,
but at higher memory cost.

Also this change introduces indirect block prefetch for rewrite
operations, that was never done before. Previously ARC miss for
Indirect blocks regularly blocked rewrites, converting perfectly
aligned asynchronous operations into synchronous read-write pairs,
significantly reducing maximal rewrite speed.

While being there this issue was also fixed:
- prefetch was done always, even if caching for the dataset was
completely disabled.

Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool
of 12 assorted HDDs shown me such performance numbers:
------- BEFORE --------
Write 491363677 bytes/sec
Read 312430631 bytes/sec
Rewrite 97680464 bytes/sec
-------- AFTER --------
Write 493524146 bytes/sec
Read 438598079 bytes/sec
Rewrite 277506044 bytes/sec

Closes #65
Closes #80

openzfs/openzfs@792fd28ac04f78cc5e43ead2d72a96f244ea84e8


297819 11-Apr-2016 smh

Only include sysctl in kernel build

Only include sysctl in kernel builds fixing warning about implicit
declaration of function 'sysctl_handle_int'.

PR: 204140
MFC after: 1 week
X-MFC-With: r297813
Sponsored by: Multiplay


297813 11-Apr-2016 smh

Only include sysctl in kernel build

Only include sysctl in kernel builds fixing warning about implicit
declaration of function 'sysctl_handle_int'.

Sponsored by: Multiplay


297812 11-Apr-2016 avg

zio: align use of "no dump" flag between use_uma and !use_uma cases

At the moment no ZFS buffers are included into a crash dump unless
ZFS_DEBUG (or INVARIANTS) kernel option is enabled. That's not very
helpful for debugging of ZFS problems, because important information
often resides in metadata buffers.
This change switches the dumping behavior when UMA is used from the
illumos behavior to a more useful behavior that we have on FreeBSD
when ZFS buffers are allocated via malloc.

Reviewed by: smh, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D5892


297773 10-Apr-2016 markj

Implement support for boot-time DTrace.

This allows one to enable DTrace probes relatively early during boot,
during SI_SUB_DTRACE_ANON, before dtrace(1) can invoked. The desired
enabling is created using dtrace -A, which writes a /boot/dtrace.dof
file and uses nextboot(8) to ensure that DTrace kernel modules are loaded
and that the DOF file describing the enabling is loaded by loader(8)
during the subsequent boot. The trace output can then be fetched with
dtrace -a.

With this commit, boot-time DTrace is only functional on i386 and amd64: on
other architectures, the high-resolution timer frequency is initialized
during SI_SUB_CLOCKS and is thus not available when the anonymous
tracing state is initialized. On x86, the TSC is used and is thus available
earlier.

MFC after: 1 month
Relnotes: yes


297763 09-Apr-2016 mav

MFV r297760: 6418 zpool should have a label clearing command

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Author: Will Andrews <will@firepipe.net>

Closes #83
Closes #32

openzfs/openzfs@9663688425131744221ea99f9e66b9ed964492ae

FreeBSD already had `zpool labelclear` functionality, so this is mostly
just a diff reduction.

MFC after: 1 month


297709 08-Apr-2016 avg

zio write issue threads should have lower (numerically greater) priority

This is because they might do data compression which is quite CPU
expensive. The original code is correct for illumos, because there
a higher priority corresponds to a greater number.

MFC after: 2 weeks


297672 07-Apr-2016 mav

Alike to r293708 relax pool check in vdev_geom_open_by_path().

This made impossible spare disk open by known path, which kind of worked
only because the same fix was applied to vdev_geom_attach_by_guids() in
r293708.

MFC after: 1 week


297633 07-Apr-2016 trasz

Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080


297513 02-Apr-2016 avg

remove emulation of VFS_HOLD and VFS_RELE from opensolaris compat

On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that
manipulate mnt_ref. But the job of properly maintaining the reference
count is already automatically performed by insmntque(9) and
delmntque(9). So, in effect all ZFS vnodes referenced the corresponding
mountpoint twice.

That was completely harmless, but we want to be very explicit about what
FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF
provide quite different guarantees with respect to the held vfs_t /
mountpoint. On illumos VFS_HOLD is sufficient to guarantee that
vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does
*not* provide the same guarantee about mnt_data. We have to use
vfs_busy() to get that guarantee.

Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed.
VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers.

And because vfs_busy has a richer interface that can not be dumbed down
in all cases it's better to explicitly use it rather than trying to mask
it behind VFS_HOLD.

This change fixes a panic that could result from a race between
zfs_umount() and zfs_ioc_rollback(). We observed a case where
zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still
using. That happened because there was nothing to prevent unmounting of
a ZFS filesystem that was in between zfs_suspend_fs() and
zfs_resume_fs().

Reviewed by: kib, smh
MFC after: 3 weeks
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2794


297509 02-Apr-2016 mav

MFV r297506: 6738 zfs send stream padding needs documentation

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>

illumos/illumos-gate@c20404ff77119516354b0d112d28b7ea0dadd303


297507 02-Apr-2016 mav

MFV r297504: 6681 zfs list burning lots of time in dodefault() via dsl_prop_*

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alex Wilson <alex.wilson@joyent.com>

illumos/illumos-gate@d09e4475f635b6f66ee68d8c17a32bba7be17c96


297473 31-Mar-2016 glebius

Fix an error in r292373. Use proper count to update "pages in" counter.

Noticed by: pfg via Coverity


297421 30-Mar-2016 mav

Plug open count leak on zvol rename.

MFC after: 2 weeks


297420 30-Mar-2016 mav

Switch from using make_dev_p() to make_dev_s() to close races.


297396 29-Mar-2016 mav

Modify "4958 zdb trips assert on pools with ashift >= 0xe".

Unlike Illumos FreeBSD has concept of logical ashift, that specifies
really minimal vdev block size that can be accessed. This knowledge
allows properly pad physical I/O and correctly assert its alignment.

This change fixes L2ARC write errors when device has logical sector
size above 512 bytes.

MFC after: 1 month


297337 28-Mar-2016 mav

Pass through error code from make_dev_p().

ENAMETOOLONG is much more informative in logs then ENXIO.

MFC after: 1 week


297232 24-Mar-2016 mav

Unify ignoring EEXIST from zvol_create_minor().

This fixes creation of zvol devices for snapshots during zfs receive,
that previously failed with "ZFS WARNING: Unable to create ZVOL" message.
This solution is not perfect, but IMHO better then it was before.

MFC after: 2 weeks


296990 17-Mar-2016 markj

Remove unused variables dtrace_in_probe and dtrace_in_probe_addr.


296615 10-Mar-2016 mav

Make ZFS ignore stripe sizes above SPA_MAXASHIFT (8KB).

If device has stripe size bigger then maximal sector size supported by
ZFS, there is nothing can be done to avoid read-modify-write cycles.
Taking that stripe size into account will only reduce space efficiency
and pointlessly bother user with warnings that can not be fixed.

Discussed with: smh


296613 10-Mar-2016 mav

Make ZFS more picky to GEOM stripe sizes and offsets.

Use of misaligned or non-power-of-2 stripes is not really useful for ZFS,
since increased ashift won't help to avoid read-modify-write cycles, and
only reduce pool space efficiency and compression rates.


296610 10-Mar-2016 mav

MFV r296609: 6370 ZFS send fails to transmit some holes

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

In certain circumstances, "zfs send -i" (incremental send) can produce a
stream which will result in incorrect sparse file contents on the
target.

The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).

Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).

Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).

We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).

The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).

Closes #37

openzfs/openzfs@adef853162e83f7cdf6b2d9af9756d434a9c743b


296563 09-Mar-2016 mav

Add new IOCTL compat shims for ABI breakage caused by r296510:
MFV r296505: 6531 Provide mechanism to artificially limit disk performance


296530 08-Mar-2016 mav

MFV r296529:
6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt()
6673 want a macro to convert seconds to nanoseconds and vice-versa

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>

illumos/illumos-gate@a8f6344fa0921599e1f4511e41b5f9a25c38c0f9


296528 08-Mar-2016 mav

MFV r296527: 6659 nvlist_free(NULL) is a no-op

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>

illumos/illumos-gate@aab83bb83be7342f6cfccaed8d5fe0b2f404855d


296523 08-Mar-2016 mav

MFV r296522: 6541 Pool feature-flag check defeated if "verify" is included
in the dedup property value

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>

illumos/illumos-gate@971640e6aa954c91b0706543741aa4570299f4d7


296521 08-Mar-2016 mav

MFV r296520: 6562 Refquota on receive doesn't account for overage

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@5f7a8e6d750cb070a3347f045201c6206caee6aa


296519 08-Mar-2016 mav

MFV r296518: 5027 zfs large block support (add copyright)

Author: Matthew Ahrens <matt@mahrens.org>

illumos/illumos-gate@c3d26abc9ee97b4f60233556aadeb57e0bd30bb9


296516 08-Mar-2016 mav

MFV r296515: 6536 zfs send: want a way to disable setting of
DRR_FLAG_FREERECORDS

Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andrew Stormont <astormont@racktopsystems.com>

illumos/illumos-gate@880094b6062aebeec8eda6a8651757611c83b13e


296514 08-Mar-2016 mav

MFV r296513: 6450 scrub/resilver unnecessarily traverses snapshots created
after the scrub started

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@38d61036746e2273cc18f6698392e1e29f87d1bf


296512 08-Mar-2016 mav

MFV r296511: 6537 Panic on zpool scrub with DEBUG kernel

Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Gary Mills <gary_mills@fastmail.fm>

illumos/illumos-gate@8c04a1fa3f7d569d48fe9b5342d0bd4c533179b9


296510 08-Mar-2016 mav

MFV r296505: 6531 Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@97e81309571898df9fdd94aab1216dfcf23e060b


296479 08-Mar-2016 markj

Fix fasttrap tracepoint locking.

Upstream, tracepoints are protected by per-CPU mutexes. An unlinked
tracepoint may be freed once all the tracepoint mutexes have been acquired
and released - this is done in fasttrap_mod_barrier(). This mechanism was
not properly ported: in some places, the proc lock is used in place of a
tracepoint lock, and in others the locking is omitted entirely. This change
implements tracepoint locking with an rmlock, where the read lock is used
in fasttrap probe context. As a side effect, this fixes a recursion on the
proc lock when the raise action is used from a userland probe.

MFC after: 1 month


296477 08-Mar-2016 markj

Remove the fasttrap implementation for sparc.

Other machine-dependent code required for DTrace on sparc is not present in
the tree, so there's no point to keeping the fasttrap code.


296475 08-Mar-2016 markj

MFV r296306: 6604 harden DIF bounds checking

Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@1c0cef67dba05c477dba779bc99224693e809a14

MFC after: 2 weeks


296021 25-Feb-2016 smh

Removed unused label and fix mutex_exit order

Remove unused done label from zfs_setacl fixing PVS-Studio V729.

Fix mutex_exit order to mirror the mutex_enter order.

MFC after: 1 week
Sponsored by: Multiplay


295707 17-Feb-2016 imp

Create an API to reset a struct bio (g_reset_bio). This is mandatory
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.


295125 01-Feb-2016 avg

MFV r294821: 6529 Properly handle updates of variably-sized SA entries.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@icyb.net.ua>

illumos/illumos-gate@e7e978b1f75353cb29673af9b35453c20c2827bf

During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.

Another problem was that index into attr_desc[] was increased even when
an attribute was removed. If that attribute was not the last attribute,
then the last attribute was lost.


295045 29-Jan-2016 asomers

Add a sysctl to allow ZFS pools backed by zvols

Change 294329 removed the ability to build ZFS pools that are backed by
zvols, because having that ability (even if it's not used) leads to
deadlocks. By popular demand, I'm adding an off-by-default sysctl to
reenable that ability.

Reviewed by: lidl, delphij
MFC after: Never
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4998


295041 29-Jan-2016 br

Welcome the RISC-V 64-bit kernel.

This is the final step required allowing to compile and to run RISC-V
kernel and userland from HEAD.

RISC-V is a completely open ISA that is freely available to academia
and industry.

Thanks to all the people involved! Special thanks to Andrew Turner,
David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and
Arun Thomas for their help.
Thanks to Robert Watson for organizing this project.

This project sponsored by UK Higher Education Innovation Fund (HEIF5) and
DARPA CTSRD project at the University of Cambridge Computer Laboratory.

FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv

Reviewed by: andrew, emaste, kib
Relnotes: Yes
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D4982


294820 26-Jan-2016 mav

MFV r294819: 6495 Fix mutex leak in dmu_objset_find_dp

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Steven Hartland <steven.hartland@multiplay.co.uk>

illumos/illumos-gate@2bad22584defe4667f99737e3158d336e4dcca11


294817 26-Jan-2016 mav

MFV r294816: 4986 receiving replication stream fails if any snapshot
exceeds refquota

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@5878fad70d76d8711f6608c1f80b0447601261c6


294815 26-Jan-2016 mav

MFV r294814: 6393 zfs receive a full send as a clone

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>

illumos/illumos-gate@68ecb2ec930c4b0f00acaf8e0abb2b19c4b8b76f

This allows to do a full (non-incremental send) and receive it as a clone
of an existing dataset. It can leverage nopwrite to share blocks with the
origin. This can be used to change the relationship of datasets on the
target. For example, maybe on the source you have:

A ---- B ---- C

And you have sent to the target a full of B, and the incremental B->C:

B ---- C

You later realize that you want to have A on the target. You will have to
do a full send of A, but nopwrite can save you space on the target if you
receive it as a clone of B, assuming that A and B have some blocks inxi
common:

B ---- C
\
A


294813 26-Jan-2016 mav

MFV r294812: 6434 sa_find_sizes() may compute wrong SA header size

Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: James Pan <jiaming.pan@yahoo.com>

illumos/illumos-gate@3502ed6e7cb3f3d2e781960ab8fe465fdc884834


294811 26-Jan-2016 mav

MFV r294810: 6414 vdev_config_sync could be simpler

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Will Andrews <will@firepipe.net>

illumos/illumos-gate@eb5bb58421f46cee79155a55688e6c675e7dd361


294809 26-Jan-2016 mav

MFV r294808: 6421 Add missing multilist_destroy calls to arc_fini

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@57deb2328260c447bf1db25fe74e0eece102733e


294807 26-Jan-2016 mav

MFV r294806: 6388 Failure of userland copy should return EFAULT

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>

illumos/illumos-gate@c71c00bbe8a9cdc7e3f4048b751f48e80441d506


294805 26-Jan-2016 mav

MFV r294804: 6386 Fix function call with uninitialized value in vdev_inuse

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>

illumos/illumos-gate@5bdd995ddb777f538bfbcc5e2d5ff1bed07ae56e


294803 26-Jan-2016 mav

MFV r294802: 6334 Cannot unlink files when over quota

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>

illumos/illumos-gate@6575bca01367958c7237253d88e5fa9ef0b1650a


294801 26-Jan-2016 mav

MFV r294800: 6385 Fix unlocking order in zfs_zget

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>

illumos/illumos-gate@eaef6a96de3f6afbbccc69bd7a0aed4463689d0a


294799 26-Jan-2016 mav

MFV r294798:
6292 exporting a pool while an async destroy is running can leave entries
in the deferred tree

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

illumos/illumos-gate@a443cc80c742af740aa82130db840f02b4389365


294797 26-Jan-2016 mav

MFV r294796: 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg

Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@b39b744be78c6327db43c1f69d11c2f5909f73cb

This is revert of 5693.


294794 26-Jan-2016 mav

MFV r294793:
6367 spa_config_tryenter incorrectly handles the multiple-lock case

Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Approved by: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@e495b6e6735b803e422025a630352ef9bba788c5


294625 23-Jan-2016 trasz

Fix ru_oublocks accounting for ZFS. There are two code paths that can be
called from zfs_write() - one of them, through dmu_write(), was handled
correctly; the other wasn't.

Reviewed by: avg@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D4923


294358 19-Jan-2016 asomers

Quell harmless CID about unchecked return value in nvlist_get_guids.

The return value doesn't need to be checked, because nvlist_get_guid's
callers check the returned values of the guids.

Coverity CID: 1341869
MFC after: 1 week
X-MFC-With: 292066
Sponsored by: Spectra Logic Corp


294329 19-Jan-2016 asomers

Disallow zvol-backed ZFS pools

Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.

Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.

Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.

Reviewed by: delphij
MFC after: 1 week
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4986


294102 15-Jan-2016 dim

MFV r294101: 6527 Possible access beyond end of string in zpool comment

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>

illumos/illumos-gate@2bd7a8d078223b122d65fea49bb8641f858b1409

This fixes erroneous double increments of the 'check' variable in a loop
in spa_prop_validate(). I ran into this in the clang380-import branch,
where clang 3.8.0 warns about it. (It is already fixed there.)

MFC after: 3 days


294027 14-Jan-2016 asomers

Fix race condition involving ZFS remove events

When a ZFS drive disappears, ZFS sends a resource.fs.zfs.removed event to
userland. A userland program like zfsd(8) can use that event, for example to
activate a hotspare. The current code contains a race condition: vdev_geom
will sent the sysevent _before_ spa.c would update the vdev's status,
causing userland processes to see pool state that does not reflect the
device removal. This change moves the sysevent to spa.c, closing the race.

Reviewed by: delphij, Sean Eric Fagan
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4902


293708 11-Jan-2016 asomers

Fix importing l2arc device by guid

After r292066, vdev_geom verifies both the vdev and pool guids of device
labels during open. However, spare and l2arc devices don't have pool guids,
so opening them by guid will fail (opening by path, when the pathname is
known, still succeeds). This change allows a vdev to be opened by guid if
the label contains no pool_guid, which is the case for inactive spares and
l2arc devices.

PR: 292066
Reported by: delphij
Reviewed by: delphij, smh
MFC after: 2 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4861


293677 11-Jan-2016 asomers

Record physical path information in ZFS Vdevs

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
If available, record the physical path of a vdev in ZFS meta-data.
Do this both when opening the vdev, and when receiving an attribute
change notification from GEOM.

Make vdev_geom_close() synchronous instead of deferring its work to
a GEOM event handler. There is no benefit to deferring the work and
this prevents a future open call from referencing a consumer that is
scheduled for destruction. The close followed by an immediate open
will occur during a vdev reprobe triggered by any type of I/O error.

Consolidate vdev_geom_close() and vdev_geom_detach() into
vdev_geom_close() and vdev_geom_close_locked(). This also moves the
cross linking operations between vdev and GEOM consumer into a
single place (linking in vdev_geom_attach() and unlinking in
vdev_geom_close_locked()).

Submitted by: gibbs, asomers
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4524


292782 27-Dec-2015 allanjude

Replace sys/crypto/sha2/sha2.c with lib/libmd/sha512c.c

cperciva's libmd implementation is 5-30% faster

The same was done for SHA256 previously in r263218

cperciva's implementation was lacking SHA-384 which I implemented, validated against OpenSSL and the NIST documentation

Extend sbin/md5 to create sha384(1)

Chase dependancies on sys/crypto/sha2/sha2.{c,h} and replace them with sha512{c.c,.h}

Reviewed by: cperciva, des, delphij
Approved by: secteam, bapt (mentor)
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D3929


292386 16-Dec-2015 glebius

Fix breakage caused by r292373 in ZFS/FUSE/NFS/SMBFS.

With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.

While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.

Pointy hat to: glebius


292373 16-Dec-2015 glebius

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


292069 11-Dec-2015 asomers

Change an important error message from ZFS_LOG to printf

Submitted by: gibbs
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp


292066 10-Dec-2015 asomers

During vdev_geom_open, require that the vdev guids match the device's label
except during split, add, or create operations. This fixes a bug where the
wrong disk could be returned, and higher layers of ZFS would immediately
eject it again.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
o When opening by GUID, require both the pool and vdev GUIDs to
match. While it is highly unlikely for two vdevs to have the same
vdev GUIDs, the ZFS storage pool allocator only guarantees they
are unique within a pool.

o Modify the open behavior to:
- If we are opening a vdev that hasn't previously been opened,
open by path without checking GUIDs.
- Otherwise, open by path and verify GUIDs.
- If that fails, search all geom providers for a device with
matching GUIDs.
- If that fails, return ENOENT.

Submitted by: gibbs, asomers
Reviewed by: smh
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4486


291963 07-Dec-2015 markj

MFV r289003:
6271 dtrace caused excessive fork time

Author: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Gordon Ross <gwr@nexenta.com>

illumos/illumos-gate@7bd3c1d12d0c764e1517c3aca62c634409356764


291962 07-Dec-2015 markj

Modify DTRACEHIOC_ADDDOF to copy the DOF section from the target process.

r281257 added support for lazyload mode by allowing dtrace(1) to register
a DOF section on behalf of a traced process. This was implemented by
having libdtrace copy the DOF section into a heap-allocated buffer and
passing its address to the ioctl handler. However, DTrace uses the DOF
section address as a lookup key in certain cases, so the ioctl handler
should be given the target process' DOF section address instead. This
change modifies the ADDDOF handler to copy the DOF section in from the
target process, rather than from dtrace(1).


291961 07-Dec-2015 markj

Add helper functions proc_readmem() and proc_writemem().

These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().

This change also adds a manual page for proc_rwmem() and the new functions.

Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D4245


291545 01-Dec-2015 stas

Make the number of fasttrap probes and the size of the trace points hash table
tunable via sysctl or kernel tunables.

Illumos allows this parameters to be changed via the fasttrap.conf configuration
file, but FreeBSD code hardcoded the parameters. Expose them under
the kern.dtrace.fasttrap sysctl tree.

MFC after: 2 weeks


290466 06-Nov-2015 smh

Switch zfs_panic_recover to panic for bad DVA

As reported by Coverity a null pointer de-reference panic would be triggered
when zfs_recover was set so switch to straight panic as it can never be
recovered.

Reported by: Coverity Scan
MFC after: 1
X-MFC-With: r290401
Sponsored by: Multiplay


290401 05-Nov-2015 smh

Provide information about bad DVA

Provide information about which vdev has an issue with a bad DVA.

MFC after: 1 week
Sponsored by: Multiplay


290399 05-Nov-2015 smh

Allow zfs_recover to be changed at runtime

MFC after: 1 week
Sponsored by: Multiplay


290266 02-Nov-2015 avg

zfs: allow the lookup of extended attributes of an unlinked file

That's required for extattr_get_fd(2) and the like to work properly.

PR: 203201
MFC after: 17 days


290191 30-Oct-2015 avg

l2arc: do not call trim_map_free() for blocks with zero b_asize

b_asize can be zero if the block is compressed into an empty block
(ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
zero-sized trimming is not attempted.
The logic for calling trim_map_free() is extracted into a new function
l2arc_trim() to minimize code duplication.

PR: 203473
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 11 days


289562 19-Oct-2015 mav

MFV r289561: 6328 Fix cstyle errors in zfs codebase

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

illumos/illumos-gate@9a686fbc186e8e2a64e9a5094d44c7d6fa0ea167


289527 18-Oct-2015 mav

MFV r289526:
5561 support root pools on EFI/GPT partitioned disks
5125 update zpool/libzfs to manage bootable whole disk pools (EFI/GPT labeled disks)

Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>

illumos/illumos-gate@1a902ef8628b0dffd6df5442354ab59bb8530962

This is NOP changes for FreeBSD.


289422 16-Oct-2015 mav

MFV r289310:
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@45818ee124adeaaf947698996b4f4c722afc6d1f

This is only a partial merge of respective ZFS infrastructure changes.
At this moment FreeBSD kernel has no those crypto algorithms, so the
parts of the code to enable them are commented out. When they are
implemented, it will be trivial to plug them in.


289362 15-Oct-2015 mav

MFV r289312: 2605 want to resume interrupted zfs send

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@9c3fd1216fa7fb02cfbc78a2518a686d54b48ab8

For more info, see:
- slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive
- video https://www.youtube.com/watch?v=iY44jPMvxog
- manpage changes (for zfs resume -s and zfs send -t)
- upcoming talk at the OpenZFS Developer Summit

The TL;DR is:
Use "zfs receive -s" to save the partially received state on failure.
On failure, get the receive token with "zfs get receive_resume_token <fs>"
Resume the send with "zfs send -t <token_value>"

Relnotes: yes


289309 14-Oct-2015 mav

MFV r289308: 6267 dn_bonus evicted too early

Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Justin T. Gibbs <gibbs@FreeBSD.org>

illumos/illumos-gate@d2058105c61ec61df3a2dd3f839fed8c3fe7bfd6


289307 14-Oct-2015 mav

MFV r289306: 6295 metaslab_condense's dbgmsg should include vdev id

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Joe Stein <joe.stein@delphix.com>

illumos/illumos-gate@daec38ecb4fb5e73e4ca9e99be84f6b8c50c02fa


289305 14-Oct-2015 mav

MFV r289304: 6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@8fe00bfb8790ad51653f67b01d5ac14256cbb404


289299 14-Oct-2015 mav

MFV r289298: 6286 ZFS internal error when set large block on bootfs

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@6de9bb5603e65b16816b7ab29e39bac820e2da2b


289297 14-Oct-2015 mav

MFV r289296: 6288 dmu_buf_will_dirty could be faster

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@0f2e7d03b8f588387cb8dd8dd500cbe5ff4484e0


289295 14-Oct-2015 mav

MFV r289294: 5219 l2arc_write_buffers() may write beyond target_sz

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Justin Gibbs <gibbs@FreeBSD.org>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@freebsd.org>

illumos/illumos-gate@d7d9a6d919f92d74ea0510a53f8441396048e800


289194 12-Oct-2015 mav

FreeBSD-specific addition to r289191.


289192 12-Oct-2015 mav

MFV r289188: 6281 prefetching should apply to 1MB reads

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Alexander Motin <mav@freebsd.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@632802744ef6d17e06d6980a95f631615c3b060f


289191 12-Oct-2015 mav

MFV r289187: 6251 add tunable to disable free_bpobj processing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@139510fb6efa97dbe5f5479594b308d940cab8d1


289190 12-Oct-2015 mav

MFV r289185: 6250 zvol_dump_init() can hold txg open

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@b10bba72460aeaa53119c76ff5e647fd5585bece


288579 03-Oct-2015 mav

Restore original array_rd_sz semantics.

Before r278702 prefetch was blocked for I/Os > 1MB, after -- >= 1MB.
1MB I/Os are used for bulk operations in CTL (XCOPY, VERIFY), and disabling
prefetch for them reduced the performance.

This is temporary local patch, that should be replaced when upstreamed.

Discussed with: mahrens
MFC after: 3 days


288415 30-Sep-2015 markj

MFV r288408:
6266 harden dtrace_difo_chunksize() with respect to malicious DIF

illumos/illumos-gate@395c7a3dcfc66b8b671dc4b3c4a2f0ca26449922

Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Bryan Cantrill <bryan@joyent.com>

MFC after: 1 week


288204 25-Sep-2015 delphij

MFV r288063: make dataset property de-registration operation O(1)

A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.

The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().

One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.

Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.

The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h:
Associate dsl property callback records with both the
dsl directory and the dsl dataset that is registering the
callback. Both connections are protected by the dsl directory's
"dd_lock".

When linking callbacks into a dsl directory, group them by
the property type. This helps reduce the space penalty for the
double association (the property name pointer is stored once
per dsl_dir instead of in each record) and reduces the number of
strcmp() calls required to do callback processing when updating
a single property. Property types are stored in a linked list
since currently ZFS registers a maximum of 10 property types
for each dataset.

Note that the property buckets/records associated with a dsl
directory are created on demand, but only freed when the dsl
directory is freed. Given the static nature of property types
and their small number, there is no benefit to freeing the few
bytes of memory used to represent the property record earlier.
When a property record becomes empty, the dsl directory is either
going to become unreferenced a little later in this thread of
execution, or there is a high chance that another dataset is
going to be loaded that would recreate the bucket anyway.

Replace dsl_prop_unregister() with dsl_prop_unregister_all().
All callers of dsl_prop_unregister() are trying to remove
all property registrations for a given dsl dataset anyway. By
changing the API, we can avoid doing any lookups of callbacks
by property type and just traverse the list of all callbacks
for the dataset and free each one.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
Replace use of dsl_prop_unregister() with the new
dsl_prop_unregister_all() API.

illumos/illumos-gate@03bad06fbb261fd4a7151a70dfeff2f5041cce1f
Author: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Illumos issue:
6171 dsl_prop_unregister() slows down dataset eviction
https://www.illumos.org/issues/6171

MFC after: 2 weeks


288064 21-Sep-2015 avg

MFV r287817: 6220 memleak in l2arc on debug build

https://github.com/illumos/illumos-gate/commit/c546f36aa898d913ff77674fb5ff97f15b2e08b4
https://www.illumos.org/issues/6220
5408 introduced a memleak in l2arc, namely the member b_thawed gets leaked when
an arc_hdr is realloced from full to l2only.

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <sensille@gmx.net>


287745 13-Sep-2015 delphij

MFV r287623: 5997 FRU field not set during pool creation and never
updated

ZFS already supports storing the vdev FRU in a vdev property. There
is code in libzfs to work with this property, and there is code in
the zfs-retire FMA module that looks for that information. But there
is no code actually setting or updating the FRU.

To address this, ZFS is changed to send a handful of new events
whenever a vdev is added, attached, cleared, or onlined, as well
as when a pool is created or imported.

Note that syseventd is not currently available on FreeBSD and thus
some work is needed to actually support the new ZFS events (e.g. in
zfsd) to actually use this capability, this changeset is mostly a
diff reduction from upstream.

illumos/illumos-gate@1437283407f89cab03860accf49408f94559bc34

Illumos issues:

5997 FRU field not set during pool creation and never updated
https://www.illumos.org/issues/5997


287744 13-Sep-2015 delphij

Note r286552 as merged and reduce diff against upstream.


287706 12-Sep-2015 delphij

MFV r287699: 6214 zpools going south

In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced. This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.

Prevent this by reintroducing b_compress

illumos/illumos-gate@d4cd038c92c36fd0ae35945831a8fc2975b5272c

Illumos issues:

6214 zpools going south
https://www.illumos.org/issues/6214


287702 12-Sep-2015 delphij

MFV r287624: 5987 zfs prefetch code needs work

Rewrite the ZFS prefetch code to detect only forward, sequential
streams.

The following kstats have been added:

kstat.zfs.misc.arcstats.sync_wait_for_async

How many sync reads have waited for async read
to complete. (less is better)

kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch

How many demand read didn't have to wait for I/O
because of predictive prefetch. (more is better)

zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.

The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.

illumos/illumos-gate@cf6106c8a0d6598b045811f9650d66e07eb332af

Illumos ZFS issues:

5987 zfs prefetch code needs work
https://www.illumos.org/issues/5987


287642 11-Sep-2015 markj

MFV r283513:
5930 fasttrap_pid_enable() panics when prfind() fails in forking process
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@9df7e4e12eb093557252d3bec029b5c382613e36


287641 11-Sep-2015 markj

MFV r283512:
3599 dtrace_dynvar tail calls can blow stack
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@d47448f09aae3aa1a87fc450a0c44638e7ce7b51


287445 04-Sep-2015 delphij

Expose an interface to determine if an ACE is inherited.

Submitted by: sef
Reviewed by: trasz
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3540


287337 31-Aug-2015 allanjude

Apply the noline attribute to vdev_queue_max_async_writes

This makes it possible to analyze the performance of the new ZFS
write throttle with dtrace

PR: 200316
Submitted by: Lacey Powers <lacey.leanne@gmail.com>
Reviewed by: avg, smh, delphij (no objection)
Approved by: bapt (mentor)
MFC after: 1 month
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D3472


287283 29-Aug-2015 delphij

Fix a buffer overrun which may lead to data corruption, introduced in
r286951 by reinstating changes in r274628.

In l2arc_compress_buf(), we allocate a buffer to stash away the compressed
data in 'cdata', allocated of l2hdr->b_asize bytes.

We then ask zio_compress_data() to compress the buffer, b_l1hdr.b_tmp_cdata,
which is of l2hdr->b_asize bytes, and have the compressed size (or original
size, if compress didn't gain enough) stored in csize.

To pad the buffer to fit the optimal write size, we round up the compressed
size to L2 device's vdev_ashift.

Illumos code rounds up the size by at most SPA_MINBLOCKSIZE. Because we
know csize <= b_asize, and b_asize is integer multiple of SPA_MINBLOCKSIZE,
we are guaranteed that the rounded up csize would be <= b_asize. However,
this is not necessarily true when we round up to 1 << vdev_ashift, because
it could be larger than SPA_MINBLOCKSIZE.

So, in the worst case scenario, we are overwriting at most

(1 << vdev_ashift - SPA_MINBLOCKSIZE)

bytes of memory next to the compressed data buffer.

Andriy's original change in r274628 reorganized the code a little bit,
by moving the padding to after we determined that the compression was
beneficial. At which point, we would check rounded size against the
allocated buffer size, and the buffer overrun would not be possible.


287280 29-Aug-2015 delphij

In r286705 (Illumos 5960/a2cdcdd), a separate thread is created with curproc
as parent. In the case of a send or receive, the curproc would be the
userland application that issues the ioctl. This would trigger an assertion
failure introduced in Solaris compatibility shims in r196458 when kernel is
compiled with INVARIANTS.

Fix this by using p0 (proc0 or kernel) as the parent thread when creating
the kernel threads.


287103 24-Aug-2015 avg

MFV (partial) r286889: 5692 expose the number of hole blocks in a file

FreeBSD porting notes:
- only kernel-side changes are merged
- the new ioctl is not actually implemented yet
- thus, the goal is to synchronize DMU code

illumos/illumos-gate@2bcf0248e992f292c7b814458bcdce2f004925d6

https://www.illumos.org/issues/5692
we would like to expose the number of hole (sparse) blocks in a file.
this can be useful to for example if you want to fill in the holes with
some data; knowing the number of holes in advances allows you to report
progress on hole filling. We could use SEEK_HOLE to do that but it would
be O(n) where n is the number of holes present in the file.

Author: Max Grossman <max.grossman@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>


287100 24-Aug-2015 avg

spa_import_rootpool: prevent lock and resource leak

The lock leak could lead to a deadlock later.

PR: 198563
Submitted by: Fabian Keil <fk@fabiankeil.de>
MFC after: 1 week


287099 24-Aug-2015 avg

account for ashift when gathering buffers to be written to l2arc device

The change that introduced the L2ARC compression support also introduced
a bug where the on-disk size of the selected buffers could end up larger
than the target size if the ashift is greater than 9. This was because
the buffer selection could did not take into account the fact that
on-disk size could be larger than the in-memory buffer size due to
the alignment requirements.

At the moment b_asize is a misnomer as it does not always represent the
allocated size: if a buffer is compressed, then the compressed size is
properly rounded (on FreeBSD), but if the compression fails or it is not
applied, then the original size is kept and it could be smaller than what
ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9. That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz. Otherwise we would overwrite active (unevicted)
L2ARC buffers. That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem could also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

This change is already in ZFS-on-Linux:
zfsonlinux/zfs@ef56b0780c80ebb0b1e637b8b8c79530a8ab3201

PR: 198242
PR: 195746 (possibly related)
Reviewed by: mahrens (https://reviews.csiden.org/r/229/)
Tested by: gkontos@aicom.gr (most recently)
MFC after: 15 days
X-MFC note: patch does not apply as is at the moment
Relnotes: yes
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2764
Reviewed by: noone (@FreeBSD.org)


286985 21-Aug-2015 avg

try to fix lor between z_teardown_lock and spa_namespace_lock

The lock order reversal and a resulting deadlock were introduced
in r285021 / D2865. The problem is that zfs_register_callbacks() calls
dsl_prop_get_integer() that has to acquire spa_namespace_lock.
At the same time, spa_config_sync() is called with spa_namespace_lock
held and then it performs ZFS vnode operations that acquire
z_teardown_lock in the reader mode.

So, fix the problem by using dsl_prop_get_int_ds() instead of
dsl_prop_get_integer(). The former does not need to look up
the pool and the dataset by name.

Reported by: many
Reviewed by: delphij
Tested by: delphij, Jens Schweikhardt <schweikh@schweikhardt.net>
MFC after: 5 days
X-MFC with: r285021


286983 21-Aug-2015 avg

fix a mismerge in r286539 (MFV 286538: 5562 ZFS sa_handle's violate...)

PR: 202358
X-MFC with: r286539
X-MFC attn: mav


286951 20-Aug-2015 mav

Restore part of r274628, reverted at r286776.

Submitted by: avg


286776 14-Aug-2015 mav

Remove some random accumulated diff from Illumos.

Submitted by: avg (partially)


286774 14-Aug-2015 mav

2618 arc.c mistypes in the comments

Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bart Coddens <bart.coddens@gmail.com>

illumos/illumos-gate@fc98fea58e89224f6f13d7fae246d6cb5dfa35ea


286770 14-Aug-2015 mav

Fix r286766 build with debug.


286767 14-Aug-2015 mav

Fix minor mismerge sometimes earlier.


286766 14-Aug-2015 mav

MFV r286765: 5817 change type of arcs_size from uint64_t to refcount_t

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@2fd872a734cf486007a8dba532cec52bfb4d40e5

As a way to make it more difficult to introduce bugs into the ARC, and to
make it easier to diagnose issues when bugs do creep in, it would be
beneficial to change the type of the arc_state_t's arcs_size field to be
a refcount_t instead of a uint64_t. This would allow us to make stricter
checks when incrementing and decrementing the value with debugging enabled,
but still fallback to simple, fast atomic operations when debugging is
disabled.


286764 14-Aug-2015 mav

MFV r285025: 6033 arc_adjust() should search MFU lists for oldest buffer
when adjusting MFU size.

illumos/illumos-gate@31c46cf23cd1cf4d66390a983dc5072d7d299ba2

https://www.illumos.org/issues/6033
When we're looking for the list containing oldest buffer we never
actually look at the MFU lists even when we try to evict from MFU.
looks like a copy paste error, the fix is here:

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>
Obtained from: illumos


286763 14-Aug-2015 mav

MFV r277431: 5497 lock contention on arcs_mtx

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@244781f10dcd82684fd8163c016540667842f203

This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.


286762 14-Aug-2015 mav

Revert part of r205231, introducing multiple ARC state locks.

This local implementation will be replaced by one from Illumos to reduce
code divergence and make further merges easier.


286712 13-Aug-2015 mav

MFV 286711: 6096 ZFS_SMB_ACL_RENAME needs to cleanup better

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@8f5190a540d64d2debee6664bbc740e4c38f5b7f


286710 13-Aug-2015 mav

MFV 286709:
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@0f92170f1ec2737ee5a0e51b5f74093904811452


286708 13-Aug-2015 mav

MFV 286707: 5959 clean up per-dataset feature count code

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@ca0cc3918a1789fa839194af2a9245f801a06b1a

A ZFS feature flags (large blocks) tracks its refcounts as the number of
datasets that have ever used the feature. Several features of this type
are planned to be added (new checksum functions). This code should be made
common infrastructure rather than duplicating the code for each feature.


286705 12-Aug-2015 mav

MFV r286704: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.


286689 12-Aug-2015 mav

MFV r284763: 5981 Deadlock in dmu_objset_find_dp

illumos/illumos-gate@1d3f896f5469c69c1339890ec3d68e9feddb0343

https://www.illumos.org/issues/5981
When dmu_objset_find_dp gets called with a read lock held, it fans out
the work to the task queue. Each task in turn acquires its own read
lock before calling the callback. If during this process anyone tries
to a acquire a write lock, it will stall all read lock requests.Thus
the tasks will never finish, the read lock of the caller will never
get freed and the write lock never acquired. deadlock.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <jansen@webgods.de>


286686 12-Aug-2015 mav

MFV r284762: 5269 zpool import slow

illumos/illumos-gate@12380e1e701fda28c9e9f32d01cafb54af279eb5

https://www.illumos.org/issues/5269
When importing a pool (at boot or with zpool import) with many
filesystem, the process can take minutes. It doesn't matter whether
the pool has been exported cleanly or uncleanly. The problem is that
each dataset has its own log chain. On import, all datasets have to be
checked if there are logs to replay. The idea is to speed up this
process by paralellizing it.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Arne Jansen <jansen@webgods.de>


286683 12-Aug-2015 mav

MFV r286682: 5765 add support for estimating send stream size with
lzc_send_space when source is a bookmark

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
Author: Max Grossman <max.grossman@delphix.com>

illumos/illumos-gate@643da460c8ca583e39ce053081754e24087f84c8


286677 12-Aug-2015 mav

MFV r286224: 5695 dmu_sync'ed holes do not retain birth time

illumos/illumos-gate@70163ac57e58ace1c5c94dfbe85dca5a974eff36

https://www.illumos.org/issues/5695
In dmu_sync_ready(), a hole block pointer will have it's logical size
explicitly set as it's necessary for replay purposes. To "undo" this,
dmu_sync_done() will zero out any hole that it finds. This becomes a
problem when using the "hole_birth" feature, as this will also wipe out
any birth time that might have happened to be set on the hole.
...
As a fix, the logic to zero out a hole is only applied to old style
holes with a birth time of zero. Holes created with the "hole_birth"
feature enabled will have a non-zero birth time, and will be skipped
(thus preserving the ltime, type, and level information as well).
In addition, zdb was updated to also print the ltime, type, and level
information for these new style holes. Previously, only the logical
birth time would be printed.

Author: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>


286655 12-Aug-2015 mav

Fix set of sign extension bugs in r286625.


286647 11-Aug-2015 mav

Fix assertion panic caused by combination of r286598 and TRIM.


286628 11-Aug-2015 mav

Fix r286625 build on i386.


286626 11-Aug-2015 mav

Fix minor mismerge in r286574.


286625 11-Aug-2015 mav

MFV r277425:
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@2ec99e3e987d8aa273f1e9ba2b983557d058198c


286623 11-Aug-2015 mav

Remove extra lock, that IMO only creates potential problems now.


286605 10-Aug-2015 mav

MFV 286604: 5812 assertion failed in zrl_tryenter(): zr_owner==NULL

Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@8df173054ca442cd8845a7364c3edad9d6822351


286603 10-Aug-2015 mav

MFV 286602: 5810 zdb should print details of bpobj

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@732885fca09e11183dd0ea69aaaab5588fb7dbff


286600 10-Aug-2015 mav

MFV 286599: 5808 spa_check_logs is not necessary on readonly pools

Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@23367a2f2caec1ccb4d918bdd0f2fc2c9cadcd06


286598 10-Aug-2015 mav

MFV 286597: 5701 zpool list reports incorrect "alloc" value for cache devices

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@a52fc310ba80fa3b2006110936198de7f828cd94


286593 10-Aug-2015 mav

Local addition and mismerge fix for r286579.


286589 10-Aug-2015 mav

MFV 286588: 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>

illumod/illumos-gate@34e8acef009195effafdcf6417aec385e241796e


286587 10-Aug-2015 mav

MFV 286586: 5746 more checksumming in zfs send

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@98110f08fa182032082d98be2ddb9391fcd62bf1


286579 10-Aug-2015 mav

MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@fe319232d24f4ae183730a5a24a09423d8ab4429


286576 10-Aug-2015 mav

Fix r286570 build with debug.


286575 10-Aug-2015 mav

MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds

Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin Gibbs <justing@spectralogic.com>

illumos/illumos-gate@bc9014e6a81272073b9854d9f65dd59e18d18c35


286574 10-Aug-2015 mav

MFV r277427: 5445 Add more visibility via arcstats; specifically
arc_state_t stats and differentiate between "data" and "metadata"

Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@4076b1bf41cfd9f968a33ed54a7ae76d9e996fe8


286570 10-Aug-2015 mav

MFV r277426: 5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <Chris.Williamson@delphix.com>

illumos/illumos-gate@89c86e32293a30cdd7af530c38b2073fee01411c

Currently, every buffer cached in the L2ARC is accompanied by a 240-byte
header in memory, leading to very high memory consumption when using very
large cache devices. These changes significantly reduce this overhead.

Currently:

L1-only header = 176 bytes
L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr
= 240 bytes

Memory-optimized:

L1-only header = 176 bytes
L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes
L2-only header = 96 bytes + 32 byte checksum = 128 bytes

So overall:

Trunk Optimized
+-----------------+
L1-only | 176 B | 176 B | (same)
+-----------------+
L1 & L2 | 240 B | 208 B | (saved 32 bytes)
+-----------------+
L2-only | 240 B | 128 B | (saved 116 bytes)
+-----------------+

For an average blocksize of 8KB, this means that for the L2ARC, the ratio
of metadata to data has gone down from about 2.92% to 1.56%. For a
'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this
means that we expect a completely full L2ARC to use (1600 GB * 0.0156) /
60GB = 41% of the available memory, down from 78%.


286556 09-Aug-2015 mav

MFV 286555: Avoid 128K kmem allocations in mzap_upgrade()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>

illumos/illumos-gate@be3e2ab906b80af79c7b22885f279e45ad8fb995


286554 09-Aug-2015 mav

MFV 286553: 5769 Cast 'zfs bad bloc' to ULL for x86

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@8c76e0763bcf0029556e106377da859f6492a7ee


286551 09-Aug-2015 mav

MFV 286550: 5694 traverse_prefetcher does not prefetch enough

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@34d7ce052c4565b078f73b95ccbd49274e98edaa


286549 09-Aug-2015 mav

MFV 286548:
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override

Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@7f7ace370074e350853da254c65688fd43ddc695


286547 09-Aug-2015 mav

MFV 286546:
5661 ZFS: "compression = on" should use lz4 if feature is enabled

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@db1741f555ec79def5e9846e6bfd132248514ffe


286545 09-Aug-2015 mav

MFV 286544:
5630 stale bonus buffer in recycled dnode_t leads to data corruption

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>


286543 09-Aug-2015 mav

MFV 286542: 5592 NULL pointer dereference in dsl_prop_notify_all_cb()

Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@9d47dec0481d8cd53b2c1053c96bfa3f78357d6a


286541 09-Aug-2015 mav

MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@e57a022b8f718889ffa92adbde47a8f08abcdb25


286539 09-Aug-2015 mav

MFV 286538:
5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@0fda3cc5c1c5a1d9bdea6d52637bef6e781549c9


286223 03-Aug-2015 smh

Fix KSTACK_PAGES check in ZFS module

The check introduced by r285946 failed to add the dependency on
opt_kstack_pages.h which meant the default value for the platform instead
of the customised options KSTACK_PAGES=X was being tested.

Also wrap in #ifdef __FreeBSD__ for portability.

MFC after: 3 days
Sponsored by: Multiplay


286167 02-Aug-2015 markj

Avoid dereferencing curthread->td_proc->p_cred in DTrace probe context.

When a process is exiting, there is a narrow window where p_cred may be
NULL while its threads are still executing. Specifically, the last thread
to exit a process sets the process state to PRS_ZOMBIE with the proc
spinlock held and then calls thread_exit(). thread_exit() drops the spin
lock, permitting the process to be reaped and thus causing its cred struct
to be released. However, the exiting thread may still cause DTrace probes
to fire by calling sched_throw(), resulting in a double fault if such a
probe enabling attempts to access the GID or UID DIF variables.

The thread's cred reference is not susceptible to this race since it is not
released until after the thread has exited.

MFC after: 1 week


285946 28-Jul-2015 smh

Add warning about low KSTACK_PAGES for ZFS use

As ZFS requires a more kernel stack pages than is the default on some
architectures e.g. i386, warn if KSTACK_PAGES is less than
ZFS_MIN_KSTACK_PAGES (which is 4 at the time of writing).

MFC after: 3 days
Sponsored by: Multiplay


285632 16-Jul-2015 mjg

vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho


285021 02-Jul-2015 avg

zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls

We now take z_teardown_lock as a writer to ensure that there is no I/O
while the filesystem state is in a flux. Also, zfs_suspend_fs() ->
zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() ->
zfsvfs_setup() call zfs_unregister_callbacks(). Previously there was no
synchronization between those calls and the calls in the re-mounting
case. That could lead to concurrent execution and a crash.

PR: 180060
Differential Revision: https://reviews.freebsd.org/D2865
Suggested by: mahrens
Reviewed by: delphij, pho, mahrens, will
MFC after: 13 days
Sponsored by: ClusterHQ


285009 01-Jul-2015 br

First cut of DTrace for AArch64.

Reviewed by: andrew, emaste
Sponsored by: ARM Limited
Differential Revision: https://reviews.freebsd.org/D2738


284593 19-Jun-2015 avg

MFV r284412: 5911 ZFS "hangs" while deleting file

Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@46e1baa6cf6d5432f5fd231bb588df8f9570c858

https://www.illumos.org/issues/5911
Sometimes ZFS appears to hang while deleting a file. It is actually
making slow progress at the file deletion, but other operations
(administrative and writes via the data path) "hang" until the file
removal completes, which can take a long time if the file has many
blocks. The deletion (or most of it) happens in a single txg, and the
sync thread spends most of its time reading indirect blocks via this
stack trace:
swtch+0x141()
cv_wait+0x70()
zio_wait+0x5b()
dbuf_read+0x2c0()
free_children+0x50()
free_children+0x12a()
free_children+0x12a()
free_children+0x12a()
dnode_sync_free_range_impl+0xdf()
dnode_sync_free_range+0x52()
range_tree_vacate+0x65()
dnode_sync+0x1d8()
dmu_objset_sync_dnodes+0x77()
dmu_objset_sync+0x19f()
dsl_dataset_sync+0x51()
dsl_pool_sync+0x9a()
spa_sync+0x2ff()
txg_sync_thread+0x21f()
thread_start+8()
One way to reproduce the problem is if we are over the arc_meta_limit,
e.g. because lots of indirect blocks are pinned because we have L0
dbufs under them. It could be that most of the L1 indirects are cached,
in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(),
it will complete very quickly. This allows dmu_free_long_range_impl() to
put many (perhaps all of its) transactions in the same TXG. However,
dmu_free_long_range_impl() calls dnode_evict_dbufs (and
dnode_free_range()), which removes the L0 dbufs, thus reducing the hold
count on the L1 indirect blocks above it, allowing them to be evicted.
Because we are over the arc_meta_limit(), these L1 blocks will be
evicted ASAP. Thus when we get to syncing context, the L1 indirects are
no longer cached and must be read in.

Obtained from: illumos
MFC after: 15 days


284591 19-Jun-2015 avg

illums compat: use flsl/flsll for highbit/highbit64

Do that only when when fast inline versions are available.
At the moment that can be the case only in the kernel and not for all
platforms.

The original code uses the binary search and that's kept as a fallback.
This is a micro optimization.

Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mahrens, mav
MFC after: 17 days


284529 17-Jun-2015 glebius

o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by: Nginx, Inc.
Sponsored by: Netflix


284520 17-Jun-2015 avg

Revert r284511 because it caused build failures on many platforms

The problem is that when inline versions of flsl and flsll are not
available, then libkern.h must be included for their declarations
in kernel sources.
The fix would be trivial, but I would like to figure out first if
it even makes sense to use the libkern provided implementations.

Reported by: bz
Pointyhat to: avg


284513 17-Jun-2015 avg

l2arc: pass correct size to trim requests

b_size is a logical size of a buffer in memory, b_asize is its physical
size that accounts for possible compression.
Currently the latter is the best approximation for the allocated, on-disk
size.

L2ARC TRIM support was committed a few weeks before L2ARC compression
was imported, so originally the code was correct, because b_size was
the size.

Further thoughts. Given that the cache device is being overwritten
in a circular fashion it is not clear if a TRIM per each evicted L2ARC
buffer has any benefits.
Maybe it would be sufficient to issue a single trim request for the whole
device when it is loaded, e.g. after a bootup, or when it is unloaded, e.g.
before a shutdown. At least as long as L2ARC is not persistent across
reboots.

Discussed with: smh
MFC after: 19 says


284511 17-Jun-2015 avg

illumos compat: use flsl/flsll for highbit/highbit64

This is a micro optimization.
The upstream code uses the binary search.

Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mav
MFC after: 15 days


284306 12-Jun-2015 avg

MFV r284036: 5961 Fix stack overflow in zfs_create_fs

illumos/illumos-gate@c701fde6911c957e71b37aac4daf672bd828f4d7

Author: glebius
MFC after: 11 days


284304 12-Jun-2015 avg

MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size

illumos/illumos-gate@81cd5c555f505484180a62ca5a2fbb00d70c57d6

Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 17 days


284303 12-Jun-2015 avg

MFV r283534: 5515 dataset user hold doesn't reject empty tags

illumos/illumos-gate@752fd8dabccac68d6d09f82f3bf3561e055e400b

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
MFC after: 10 days


284301 12-Jun-2015 avg

MFV r284040: check that datasets are snapshots

5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

illumos/illumos-gate@24218bebb460e4015fac2c9f2cec1902eddbcd7b

Note that the upstream commit is modified during MFV: in the upstream
the check is done by inspecting ds_is_snapshot field while in FreeBSD
we call dsl_dataset_is_snapshot().
This is because illumos/illumos-gate@bc9014e6a81272073b9854d9f65dd59e18d18c35
(r277428 in vendor-sys/illumos) is not MFV-ed yet.

MFC after: 10 days


283629 27-May-2015 kib

Add missed {}.

Noted by: Morten Rodal <morten@rodal.no>
MFC after: 2 weeks


283602 27-May-2015 kib

Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode. Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced. dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


283525 25-May-2015 avg

zfs: fixes for a full stream received into an existing dataset

- this should fail early unless the force flag is set
- if the force flag is set then any local modifications including
snapshots should be undone

See: https://www.illumos.org/issues/5912
See: https://reviews.csiden.org/r/220/

Reviewed by: mahrens, Paul Dagnelie <pcd@delphix.com>
MFC after: 15 days
Sponsored by: ClusterHQ


283524 25-May-2015 avg

dsl_dataset_promote_check: ensure that shared snaps do not become too long

... after they are transfered from the old origin to the new one.

See: https://www.illumos.org/issues/5909
See: https://reviews.csiden.org/r/219/

Reviewed by: mahrens
MFC after: 10 days
Sponsored by: ClusterHQ


283515 25-May-2015 kib

Remove excess Giant acquisition around the dounmount() call.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


283291 22-May-2015 jkim

CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks


282880 14-May-2015 smh

Add copyright info missing from r282205

Add the copyright info missing from ZoL origin version.

MFC after: 2 days
Sponsored by: Multiplay


282766 11-May-2015 avg

zfs ioctls: use fget_write / fget_read instead of getf wrapper for fget

This allows to ensure that we do not write to a file that was opened
for reading only or vice versa.

Also, use the correct capability in in zfs_ioc_send_new().

Differential Revision: https://reviews.freebsd.org/D2382
Reviewed by: delphij
MFC after: 17 days
Sponsored by: ClusterHQ


282703 10-May-2015 jhibbits

Fix a couple bugs in 64-bit powerpc fasttrap argument retrieval.

Found by code inspection.


282632 08-May-2015 avg

MFV r282630: 5809 Blowaway full receive in v1 pool causes kernel panic

MFC after: 5 days


282475 05-May-2015 avg

zfs: do not hold an extra reference on a root vnode while a filesystem is mounted

At present zfs_domount() acquires a reference on the filesystem's root vnode
and that reference is kept until zfs_umount.
The latter calls vflush(rootrefs = 1) to dispose of the extra reference.

There is no explanation of why that reference is kept - what problem it
solves or what behavior it improves.
Also, that logic is FreeBSD specific.

There is one real problem with that reference, though.
zfs recv -F may receive a full, non-incremental stream to a mounted filesystem.
In that case the received root object is likely to have a different z_gen
attribute value. Because of that, zfs_rezget will leave the previous root znode
and vnode disassociated from the actual object (z_sa_hdl == NULL).
Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode
pair, while the old one will be kept alive by the outstanding reference.
So, the outstanding reference will not actually be for the new root vnode
(or, more precisely, vnodes - because a root vnode may be recycled and a newer
one can be created).
As a result, when vflush(rootrefs = 1) s called there will be two problems:

- a leaked reference on the old root vnode preventing a graceful unmount
- insufficient references on the actual root vnode leading to a crash upon
access to the vnode after it is destroyed by vgone() + vdrop()

The second issue will actually override the first one.

Differential Revision: https://reviews.freebsd.org/D2353
Reviewed by: delphij, kib, smh
MFC after: 17 days


282473 05-May-2015 avg

dmu_recv_end_check: don't leak hold if dsl_destroy_snapshot_check_impl fails

The leak may happen if !drc_newfs && drc_force and there is an error
iterating through snapshots or any of snapshot checks fails.

See https://www.illumos.org/issues/5870
See https://reviews.csiden.org/r/206/

Reviewed by: mahrens (as mahrens@delphix.com)
MFC after: 15 days
Sponsored by: ClusterHQ


282205 28-Apr-2015 smh

Fix misuse of input argument in traverse_visitbp

In traverse_visitbp(), the input argument dnp is modified in the middle
to point to a temporary buffer. Originally this doesn't matter, because
no user of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of
code is added dereferencing dnp after the modification, creating a possible
bug.

We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.

Obtained from: zfsonlinux (a585f2f844ed3d4270221fed88f5e494eb55d932)
MFC after: 2 weeks
Sponsored by: Multiplay


282131 28-Apr-2015 avg

replace a comment about zfs recv -F corner case with a longer, more detailed one

The old comment in zfs_rezget explains what situation the code handles,
the new comment also describes how the situation can arise.

Also, re-join a line that became sufficiently shorti some time ago.

Differential Revision: https://reviews.freebsd.org/D2352
Reviewed by: delphij, smh
MFC after: 12 days


282130 28-Apr-2015 avg

zfs_onexit_fd_hold: return EBADF even if devfs_get_cdevpriv gave ENOENT

/dev/zfs always has per-open data, so when it is missing the file
descriptor is for some other file. Returning ENOENT in this case
is confusing as a variety of other conditions (like a missing dataset)
may result in the same error. It's better to consistently return
EBADF for any problems with the file descriptor.

Note that zfs_onexit_fd_hold() is used with 'automatic cleanup fd'
- when that fd is closed, typically because a process is terminated,
some cleanup action is taken by ZFS driver. E.g. a temporary
snapshot hold is released.

Perhaps, it would even be worthwhile changing devfs_get_cdevpriv()
to return EBADF if there is no associated data.

Differential Revision: https://reviews.freebsd.org/D2370
Reviewed by: delphij, smh
MFC after: 12 days


282127 28-Apr-2015 avg

dsl_dir_rename_check: return EXDEV on cross-pool rename attempt

Obtained from: zfsonlinux/zfs@9063f65476b7b7d78ccf096fec890b8727117e2a
Obtained from: Boris Protopopov <boris.protopopov@actifio.com>
MFC after: 10 days


282126 28-Apr-2015 avg

MFV r282123: 5610 zfs clone from different source and target pools produces coredump

MFC after: 10 days


282125 28-Apr-2015 avg

MFV r282124: 5393 spurious failures from dsl_dataset_hold_obj()

The actual bugfix was pro-actively committed in r275515.
This MFV is cosmetic, it just aligns code style with the upstream.

MFC after: 10 days


281916 24-Apr-2015 markj

Fix DTrace's panic() action.

It would previously call into some unfinished Solaris compatibility code and
return without actually calling panic(9). The compatibility code is
unneeded, however, so just remove it and have dtrace_panic() call vpanic(9)
directly.

Differential Revision: https://reviews.freebsd.org/D2349
Reviewed by: avg
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division


281667 17-Apr-2015 delphij

Remove vfs.zfs.snapshot_list_prefetch, the corresponding code was
gone in r248571 already.

MFC after: 1 week


281257 08-Apr-2015 markj

libdtrace: add support for lazyload mode.

Passing "-x lazyload" to dtrace -G during compilation causes dtrace(1) to
not link drti.o into the output object file, so the USDT probes are not created
during process startup. Instead, dtrace(1) will automatically discover and
create probes on the process' behalf when attaching.

Differential Revision: https://reviews.freebsd.org/D2203
Reviewed by: rpaulo
MFC after: 1 month


281109 05-Apr-2015 mav

Add DTrace probe to the new ARC reclaim cause added in r281026.

MFC after: 1 month


281026 03-Apr-2015 mav

Make ZFS ARC track both KVA usage and fragmentation.

Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise). FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.

Address this situation from two sides:
- restore KVA usage limitations in a way the most close to Illumos;
- introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.

Experiments show that first limitation done alone is not sufficient. On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk. Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)

MFC after: 1 month


280951 01-Apr-2015 andrew

Add the arm64 defines for cddl code.

Differential Revision: https://reviews.freebsd.org/D2186
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation


280822 29-Mar-2015 mav

Some cosmetic polishing. No functional change.

MFC after: 1 week


280132 16-Mar-2015 markj

Remove unused upstream DTrace provider implementations that are duplicates
of providers under sys/cddl/dev/. Also remove sdt_subr.c, which isn't used
in FreeBSD's SDT implementation.

Suggested by: rwatson


279996 14-Mar-2015 smh

Allow zvol_geom_worker to process BIO_DELETE's

If zvol_geom_start is called with a BIO_DELETE from a thread which can
sleep it queues it for later processing by the zvol_geom_worker. The
zvol_geom_worker didn't have a delete case so would simply loose the bio
hence preventing the original caller from every completing. In addition
an other unknown types would suffer the same fate.

Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and
return unsupported for all unknown bio types.

MFC after: 2 weeks
Sponsored by: Multiplay


279927 12-Mar-2015 mav

Make DIOCGATTR in device mode handle "GEOM::candelete".

MFC after: 3 days


279667 05-Mar-2015 andrew

Add the MD parts of dtrace needed to use fbt on ARM. For this we need to
emulate the instructions used in function entry and exit.

For function entry ARM will use a push instruction to push up to 16
registers to the stack. While we don't expect all 16 to be used we need to
handle any combination the compiler may generate, even if it doesn't make
sense (e.g. pushing the program counter).

On function return we will either have a pop or branch instruction. The
former is similar to the push instruction, but with care to make sure we
update the stack pointer and program counter correctly in the cases they
are either in the list of registers or not. For branch we need to take the
24-bit offset, sign-extend it, and add that number of 4-byte words to the
program counter. Care needs to be taken as, due to historical reasons, the
address the branch is relative to is not the current instruction, but 8
bytes later.

This allows us to use the following probes on ARM boards:
dtrace -n 'fbt::malloc:entry { stack() }'
and
dtrace -n 'fbt::free:return { stack() }'

Differential Revision: https://reviews.freebsd.org/D2007
Reviewed by: gnn, rpaulo
Sponsored by: ABT Systems Ltd


278529 10-Feb-2015 gnn

Initial version of DTrace on ARM32.

Submitted by: Howard Su based on work by Oleksandr Tymoshenko
Reviewed by: ian, andrew, rpaulo, markj


278370 08-Feb-2015 markj

Fix a typo in r278137: make sure to free provider state.

X-MFC-With: r278136


278167 03-Feb-2015 pfg

MFV r266995:

4767 dtrace_probe() always has the timestamp

Reference:
https://illumos.org/issues/4767

Obtained from: Illumos
MFC after: 2 weeks


278166 03-Feb-2015 pfg

MFV r266993:

4469 DTrace helper tracing should be dynamic

Reference:
https://illumos.org/issues/4469

Obtained from: Illumos
Phabric: D1551
Reviewed by: markj
MFC after: 2 weeks


278137 03-Feb-2015 markj

Continue to handle the case where state is NULL, though this currently
cannot happen on FreeBSD. r278136 overlooked the fact that a destructor
registered with devfs_set_cdevpriv(9) is invoked even in the case of an
error.

X-MFC-With: r278136


278136 03-Feb-2015 markj

Diff reduction with illumos, in preparation for merging r266993 from the
vendor branch. No functional change.

MFC after: 1 week


278040 02-Feb-2015 smh

Prevent inlining txg_quiesce

This allows dtrace to monitor the calls to txg_quiesce which can be really
helpful.

Also standardise __noinline order for arc_kmem_reap_now.

Sponsored by: Multiplay


277915 30-Jan-2015 markj

Don't attempt to disable enabled fasttrap probes in an exiting process.
There's no need to do so, and we can't hold an exiting process, so this
race can result in panics.

MFC after: 1 week


277914 30-Jan-2015 markj

In fasttrap_sigtrap(), use tdsendsignal() rather than tdksignal() to send
SIGTRAP. The latter requires that its thread argument be non-NULL, but
fasttrap_sigtrap() does not.

PR: 193593
MFC after: 1 week
Reported by: danilo


277826 28-Jan-2015 delphij

MFV r255258:

Diff reduction with upstream. The actual change was merged in r272483
already.

MFC after: 2 weeks


277629 24-Jan-2015 will

When creating or updating a node, use vfs_timestamp() for "now" instead
of gethrestime(), to allow the administrator to decide the appropriate
timestamp precision instead of always using nanosecond precision.


277504 21-Jan-2015 will

Remove commented log messages.


277503 21-Jan-2015 will

Ignore sync requests from the system syncher, i.e. VFS_SYNC(waitfor=MNT_LAZY).

ZFS already commits outstanding data every zfs_txg_timeout seconds, so these
syncs are unnecessarily intrusive.

Submitted by: gibbs
Sponsored by: Spectra Logic
MFSpectraBSD: 1105759 on 2014/12/11


277501 21-Jan-2015 will

Eliminate an #ifdef illumos for zfs_ioc_rename().

Since allow_mounted is a FreeBSD-specific change, default to B_TRUE, then
locally check for the magic bit. Unconditionally check allow_mounted below.
Convert the setting of allow_mounted to an explicit boolean.

MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 672578 (in part) on 2013/07/19


277492 21-Jan-2015 will

Add vfs.zfs.reference_tracking_enable sysctl/tunable.

This is primarily for developer/debugging use; it enables built-in tagged
tracking of refcounts inside ZFS. It can only be enabled from the loader,
since it modifies how in-core state is managed. Default remains disabled.

MFC after: 1 week
Sponsored by: Spectra Logic


277452 20-Jan-2015 will

Fix arc__shrink DTrace probe's to_free argument.

Remove the unnecessary #ifdef _KERNEL, which did not differ in the true or
false cases. Actually set the value of to_free before using it.

MFC after: 1 week
Sponsored by: Spectra Logic


277450 20-Jan-2015 will

Use the "zfs_gfs" tag for GFS vnodes to make them easier to identify.

MFC after: 1 week
Sponsored by: Spectra Logic


277419 20-Jan-2015 mav

Allow skipping dmu_buf_will_dirty() call in dsl_dir_transfer_space().

dsl_dir_transfer_space() is mostly called after dsl_dir_diduse_space(),
which already calls dmu_buf_will_dirty() for the same dbuf and tx, so
its duplicate call in those cases will change nothing, only spend time.

Skipping this call by four times reduces time spent in dbuf_write_done()
and descendants, updating dataset statistics with several congested lock
acquisitions. When rewriting 8K zvol blocks at 1GB/s rate, this reduces
CPU time spent inside dbuf_write_done(), according to profiling, from 45%
of 683K samples to 18% of 422K.

MFC after: 2 weeks


277351 18-Jan-2015 smh

Clean ZFS spa config before syncing

A number of entries that can be present in the spa config shouldn't be saved
to disk so add a method to ensure this is case. Without this if the last
caller to vdev_config_generate requested stats then we can end up in the
cache file.

Also only skip a none writable pool in the cache file generation if its
active. This prevents unavailable pools incorrectly getting removed from
cache file.

Tested by: delphij
MFC after: 2 weeks
Sponsored by: Multiplay


277300 17-Jan-2015 smh

Mechanically convert cddl sun #ifdef's to illumos

Since the upstream for cddl code is now illumos not sun, mechanically
convert all sun #ifdef's to illumos #ifdef's which have been used in all
newer code for some time.

Also do a manual pass to correct the use if #ifdef comments as per style(9)
as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.

MFC after: 1 month
Sponsored by: Multiplay


277185 14-Jan-2015 mav

Fix overflow bug from r248577, turning 30s TRIM timeout into ~4s.

MFC after: 2 weeks


277169 14-Jan-2015 mav

Reimplement TRIM throttling added in r248577.

Previous throttling implementation approached problem from the wrong side.
It significantly limited useful delaying of TRIM requests and aggregation
potential, while not so much controlled TRIM burstiness under heavy load.

With this change random 4K write benchmarks (probably the worst case for
TRIM) show me IOPS increase by 20%, average latency reduction by 30%, peak
TRIM bursts reduction by 3 times and same peak TRIM map size (memory usage).

Also the new logic does not force map size down so heavily, really allowing
to keep deleted data for 32 TXG or 30 seconds under moderate load. It was
practically impossible with old throttling logic, which pushed map down to
only 64 segments.

Reviewed by: smh
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


277096 12-Jan-2015 mav

Skip extra bcopy() when scrubbing vdev without redundancy.

According to profiler, this bcopy() can use about 10% of CPU time.

MFC after: 2 weeks


276983 11-Jan-2015 mav

When aggregating TRIM segments, move the new one to the list end.

New segment at the list head may block all TRIM requests until txg of that
segment can be processed. On my random I/O tests this change reduce peak
TRIM list length from 650 to 450 segments. Hopefully it should reduce TRIM
burstiness when list processing is unblocked.

MFC after: 2 weeks


276952 11-Jan-2015 mav

Add LBA as secondary sort key for synchronous I/O requests.

On FreeBSD gethrtime() implemented via getnanouptime(), that has 1ms (1/hz)
precision. It makes primary sort key (timestamp) collision very possible.
In such situations sorting by secondary key of LBA is much more reasonable
then by totally meaningless zio pointer value.

With this change on multi-threaded synchronous ZVOL read I've measured 10%
throughput increase and average latency reduction.

MFC after: 2 weeks


276913 10-Jan-2015 mav

Use new optimized dmu_read_uio_dbuf() for ZVOLs in device mode.

This slightly reduces overhead by avoiding dnode_hold()/dnode_rele() calls.

MFC after: 2 weeks


276450 31-Dec-2014 smh

Correct zpool list displaying invalid EXPANDSZ for unavailable pool vdevs

When pools are unavailable their vdevs are also unavailable which means
that vdev_max_asize remains at the default zero. This default was being
used to calculate vs_esize resulting in a negative number as vdev_asize >
vdev_max_asize, which caused zpool list -v to display 16.0E for EXPANDSZ
of these vdevs.


276123 23-Dec-2014 smh

Always sync the global ZFS config cache to reflect the new mosconfig

This fixes out of date zpool.cache for root pools, which can cause issues
such as confusion of zdb etc.

MFC after: 1 month


276069 22-Dec-2014 smh

Fix panic when resizing ZFS zvol's

Resizing a ZFS ZVOL with debug enabled would result in a panic due to
recursion on dp_config_rwlock.

The upstream change "3464 zfs synctask code needs restructuring" changed
zvol_set_volsize to avoid the recursion on dp_config_rwlock, but this was
missed when originally merged in by r248571 due to significant differences
in our codebases in this area.

These changes also relied on bring in changes from upstream:
3557 dumpvp_size is not updated correctly when a dump zvol's size is
changed, which where also not present.

In order to help prevent future issues in this area a direct comparison
and diff minimisation from current upstream version (b515258) of zvol.c.

Differential Revision: https://reviews.freebsd.org/D1302
MFC after: 1 month
X-MFC-With: r276063 & r276066
Sponsored by: Multiplay


276066 22-Dec-2014 smh

Refactor zvol locking to minimise diff with upstream

Use #define zfsdev_state_lock spa_namespace_lock instead of replacing all
zfsdev_state_lock with spa_namespace_lock to minimise changes from upstream.

Differential Revision: D1302
MFC after: 1 month
X-MFC-With r276063
Sponsored by: Multiplay


276063 22-Dec-2014 smh

Standardise on illumos for #ifdef's in zvol.c

Also correct as per style(9) on the use of #ifdef comments.

This is a no-op change as pre-cursor to a full cleanup and merge with
upstream zvol changes.

Sponsored by: Multiplay


276007 21-Dec-2014 kib

Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some
fs, e.g. smbfs, already did it.

Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


275923 19-Dec-2014 delphij

Add missing continue: we can't proceed further if the
kernel does not panic with zfs_panic_recover.

Illumos issue:
5438 zfs_blkptr_verify should continue after zfs_panic_recover

Reported by: Coverity
CID: 1232014


275922 18-Dec-2014 delphij

MFV r275914:

As of r270383, the dbuf_compare comparator compares the dbuf
attributes in the following order:

db_level (indirect level)
db_blkid (block number)
db_state (current state)
the address of the element

Because db_state is being considered before the element's state,
changing of db_state would affect balancedness of the AVL tree,
even when the address of element compares differently. For
instance, in dbuf_create, db_state may be altered after the
node is inserted into the AVL tree and may break AVL tree
balancedness.

Instead of using db_state as a comparision critera (introduced
in r270383), consider it only when we are doing a lookup, that
is one of the two dbuf pointers contains DB_SEARCH.

Illumos issue:
5422 preserve AVL invariants in dn_dbufs

MFC after: 2 weeks


275897 18-Dec-2014 kib

The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


275811 15-Dec-2014 delphij

MFV r275783:

Convert ARC flags to use enum. Previously, public flags are defined in
arc.h and private flags are defined in arc.c which can lead to confusion
and programming errors.

Consistently use 'hdr' (when referencing arc_buf_hdr_t) instead of 'buf'
or 'ab' because arc_buf_t are often named 'buf' as well.

Illumos issue:
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme

MFC after: 2 weeks


275782 15-Dec-2014 delphij

MFV r275551:

Remove "dbuf phys" db->db_data pointer aliases.

Use function accessors that cast db->db_data to the appropriate
"phys" type, removing the need for clients of the dmu buf user
API to keep properly typed pointer aliases to db->db_data in order
to conveniently access their data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
In zap_leaf() and zap_leaf_byteswap, now that the pointer alias
field l_phys has been removed, use the db_data field in an on
stack dmu_buf_t to point to the leaf's phys data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
Remove the db_user_data_ptr_ptr field from dbuf and all logic
to maintain it.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
Modify the DMU buf user API to remove the ability to specify
a db_data aliasing pointer (db_user_data_ptr_ptr).

cddl/contrib/opensolaris/cmd/zdb/zdb.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_diff.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
Create and use the new "phys data" accessor functions
dsl_dir_phys(), dsl_dataset_phys(), zap_m_phys(),
zap_f_phys(), and zap_leaf_phys().

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
Remove now unused "phys pointer" aliases to db->db_data
from clients of the DMU buf user API.

Illumos issue:
5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS

MFC after: 2 weeks


275781 15-Dec-2014 delphij

MFV r275550:

In addition to r273158, make the code in spa_sync() that checks if the
current TXG is a no-op TXG less fragile.

Illumos issue:
5347 idle pool may run itself out of space

MFC after: 2 weeks


275780 15-Dec-2014 delphij

MFV r275549:

Add a loader tunable, vfs.zfs.arc_meta_min, which controls how much metadata
ZFS should keep in ARC at minimum.

In arc_evict(), when doing recycle, take more factors into account by
applying the following policy:

1. If no evictable data, evict metadata;
2. If no evictable metadata, evict data;
3. If we hit arc_meta_limit, evict metadata;
4. If we haven't hit arc_meta_min, evict data;
5* (Illumos only, not present in new FreeBSD code, yet) evict the oldest
cached element from data and metadata.
(FreeBSD) evict the data type specified by caller, which is the
existing behavior.

Note that because of our splitted locks (implemented in r205231 to improve
scalability by reducing lock contention), implementing the fifth Illumos
behavior will not be cheap, so for now just implement the 1-4 and fall back
to current behavior for 5.

Illumos issue:
5368 ARC should cache more metadata

MFC after: 2 months (assuming we didn't found better solution)


275748 13-Dec-2014 delphij

MFV r247174:

Expose arc_meta_limit, et al via kstats.

Note that as a result, vfs.zfs.arc_meta_used is removed.
The existing vfs.zfs.arc_meta_limit sysctl/tunable is retained
with a SYSCTL_PROC wrapper.

Illumos ZFS issues:
3561 arc_meta_limit should be exposed via kstats

Relnotes: yes
MFC after: 2 weeks


275740 13-Dec-2014 delphij

MFV r275548:

Verify that the block pointer is structurally valid, before attempting to
read it in. It can only be invalid in the case of a ZFS bug, but this
change will help identify such bugs in a more transparent way, by
panic'ing with a relevant message, rather than indexing off the end of an
array or something.

Illumos issue:
5349 verify that block pointer is plausible before reading

MFC after: 2 weeks


275738 13-Dec-2014 delphij

MFV r275546:

Reduce scrub activities when system there is enough dirty data, namely when
dirty data is more than zfs_vdev_async_write_active_min_dirty_percent (once
we start to increase the number of concurrent async writes).

While there also correct rounding error which would make scrub end up
pausing for (zfs_txg_timeout + 1) seconds instead of the desired
zfs_txg_timeout seconds.

Illumos issue:
5351 scrub goes for an extra second each txg
5352 scrub should pause when there is some dirty data

MFC after: 2 weeks


275737 13-Dec-2014 delphij

MFV r275545:

If zio_checksum_error() returns other than ECKSUM (e.g. EINVAL), it does not
fill in the "zio_bad_cksum_t *info" parameter. Caller should not attempt to
use it in this case.

Illumos issue:
5348 zio_checksum_error() only fills in info if ECKSUM

MFC after: 2 weeks


275736 13-Dec-2014 delphij

MFV r275544:

Clean up some duplicated code in dnode_sync() around freeing spill blocks.

Illumos issue:
5350 clean up code in dnode_sync()

MFC after: 2 weeks


275735 13-Dec-2014 delphij

MFV r275543:

Remove always true tests for ds->ds_phys' presence.

Clean up assertions in dsl_dataset_disown.

Remove unreachable code in dsl_dataset_disown().

Illumos issue:
5310 Remove always true tests for non-NULL ds->ds_phys

MFC after: 2 weeks


275734 13-Dec-2014 delphij

MFV r275542:

If a dnode has a spill block and there is an error while accessing
a data block then traverse_dnode() loses information about that error
and returns a status of visiting the spill block.

This issue is discovered by Spectra Logic.

Illumos issue:
5311 traverse_dnode may report success when it should not

Original author: gibbs
MFC after: 2 weeks


275594 08-Dec-2014 delphij

MFV r275540:

When importing a pool, don't assume that the passed pool configuration
at vdev_load is always vaild. It's possible that a stale configuration
that comes with extra vdevs, where metaslab_init() would fail because
of lower layer returns error.

Change the code to make metaslab_init() handle and return errors from
lower layer and pass it back to upper layer and handle it there.

Illumos issue:
5213 panic in metaslab_init due to space_map_open returning ENXIO

MFC after: 2 weeks


275576 07-Dec-2014 avg

remove opensolaris cyclic code, replace with high-precision callouts

In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior. Now we can directly use callout(9) in the very few
places where cyclic was used.

Differential Revision: https://reviews.freebsd.org/D1161
Reviewed by: gnn, jhb, markj
MFC after: 2 weeks


275565 06-Dec-2014 andrew

Apply the same fix in r274697 to the ARM case.


275562 06-Dec-2014 delphij

MFV r275535:

Unexpand ISP2() and MSEC2NSEC().

Illumos issue:
5255 uts shouldn't open-code ISP2

MFC after: 2 weeks


275561 06-Dec-2014 delphij

MFV r275534:

Sync with Illumos. This have no effect to FreeBSD.

Illumos issue:
5285 pass in cpu_pause_func via pause_cpus

MFC after: 2 weeks


275560 06-Dec-2014 delphij

MFC r275533:

Sync with Illumos. This have no effect to FreeBSD.

Illumos issue:
5100 sparc build failed after 5004

MFC after: 2 weeks


275530 05-Dec-2014 delphij

Use %d instead of %u for error number. This way we see ERESTART as -1
not 4294967295 when doing DTrace.

MFC after: 2 weeks


275515 05-Dec-2014 delphij

Fix a regression introduced in r274337 (large block support)

In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS)
to determine whether the extensible (zapifyed) dataset have large blocks.
The code expects the result be either 0 (found) or ENOENT (not found),
however reused the variable 'err' which later code expects to be 0.

Fix this by adopting similar code construct that is used later for
DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch
errors from zap_* rountines.

Reported by: Peter J. Creath (on FreeNAS; FreeNAS bug #6848)
Illumos issue: 5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by: mahrens
Sponsored by: iXsystems, Inc.
X-MFC with: r274337


275474 04-Dec-2014 mav

Add GET LBA STATUS command support to CTL.

It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


275401 02-Dec-2014 avg

zfs_putpages: actually update mtime and ctime

Reported by: Paul Koch <paul.koch@akips.com>
Tested by: Paul Koch <paul.koch@akips.com>
MFC after: 2 weeks


275096 26-Nov-2014 delphij

Revert r273060 per discussion with avg@ as we need to make L2ARC
aware of 4K devices and this one is not the right fix anyway.


274697 19-Nov-2014 dim

Fix the following -Werror warning from clang 3.5.0, while building cddl/lib/libctf:

In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31:
In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34:
sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined]
#define _ILP32
^
<built-in>:26:9: note: previous definition is here
#define _ILP32 1
^
1 warning generated.

This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for
the i386 arch. (Earlier versions already predefined _LP64 and __LP64__
for the x86_64 arch.)

Reviewed by: emaste, avg, smh, delphij, markj
Differential Revision: https://reviews.freebsd.org/D1187


274681 18-Nov-2014 delphij

Make vfs.zfs.max_recordsize read-write at runtime.

MFC after: 2 weeks


274674 18-Nov-2014 delphij

Add a tunable for spa_slop_shift which controls how much space we
would reserve by default. Tuning is not recommended.

MFC after: 2 weeks


274673 18-Nov-2014 delphij

Allow tuning zfs_max_recordsize via loader tunable. Tuning is NOT
recommended.

Requested by: Slawa Olhovchenkov <slw zxy spb ru>
MFC after: 2 weeks


274628 17-Nov-2014 avg

l2arc: restore correct rounding up of asize of compressed data

This rounding up was lost in a mismerge of illumos code.
See r268075 MFV r267565.
After that commit zio_compress_data() no longer performs any compressed
size adjustment, so it needs to be done externally. On FreeBSD we round
up the size using vdev_ashift rather than SPA_MINBLOCKSIZE so that 4KB
devices are properly supported.

Additionally, zero out the buffer tail only if compression succeeds.
The compression is considered successful if the size of compressed
data after rounding up to account for the vdev ashift is less than the
original data size. It does not make sense to have the data compressed
if all the savings are lost to rounding up.
With the new zio_compress_data() it could have been possible that the
rounded compressed size would be greater than the original size and thus
we could zero beyond the allocated buffer if the zeroing code was kept
at the original place.

Discussed with: delphij, gibbs
MFC after: 2 weeks
X-MFC with: r274627


274627 17-Nov-2014 avg

Revert r269093 which introduced physical zio alignment transform

Size of physical ZIOs must never be implicitly adjusted, it's
a responsibility of a caller to make sure that such a ZIO has proper offset
and size.

Discussed with: delphij, gibbs
MFC after: 2 weeks


274619 17-Nov-2014 smh

Disable TRIM on file backed ZFS vdevs and fix TRIM on init

After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL
this meant file backed vdevs to attempted to process the ZIO as a write
causing a panic.

We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported
by each vdev type to ensure we explicity support the ZIO type being
processed.

Also ensure that TRIM on init is not procesed for devices which declare they
didn't support TRIM via vdev_notrim.

PR: 195061, 194976, 191573
Sponsored by: Multiplay


274337 10-Nov-2014 delphij

MFV r274273:

ZFS large block support.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.

Limited safety belt is provided for mounted root filesystem but use
caution is advised.

Illumos issue:
5027 zfs large block support

MFC after: 1 month


274304 09-Nov-2014 delphij

MFV r274272 and diff reduction with upstream.

Illumos issue:
5244 zio pipeline callers should explicitly invoke next stage

Tested with: ztest plus ZFS over GELI configuration
MFC after: 1 month


274276 08-Nov-2014 delphij

MFV r274271:

Improve zdb -b performance:

- Reduce gethrtime() call to 1/100th of blkptr's;
- Skip manipulating the size-ordered tree;
- Issue more (10, previously 3) async reads;
- Use lighter weight testing in traverse_visitbp();

Illumos issue:
5243 zdb -b could be much faster

MFC after: 2 weeks


274172 06-Nov-2014 avg

fix l2arc compression buffers leak

We have observed that arc_release() can be called concurrently with a
l2arc in-flight write.
Also, we have observed that arc_hdr_destroy() can be called from
arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar
circumstances.

Previously the l2arc headers would be freed while leaking their
associated compression buffers. Now the buffers are placed on
l2arc_free_on_write list for delayed freeing. This is similar to what
was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.

In addition to fixing the discovered leaks this change also adds some
protective code to assert that a compression buffer associated with a
l2arc header is never leaked.

A new kstat l2_cdata_free_on_write is added. It keeps a count of
delayed compression buffer frees which previously would have been leaks.

Tested by: Vitalij Satanivskij <satan@ukr.net> et al
Requested by: many
MFC after: 2 weeks
Sponsored by: HybridCluster / ClusterHQ


274154 06-Nov-2014 mav

Add to CTL support for logical block provisioning threshold notifications.

For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.

MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


273641 25-Oct-2014 jpaetzel

This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109

Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD

"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.

b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.

c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.

One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.

This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.

PR: 184677
Submitted by: kmacy
Reviewed by: delphij, will, avg
MFC after: 2 weeks
Sponsored by: iXsystems


273377 21-Oct-2014 hselasky

Fix multiple incorrect SYSCTL arguments in the kernel:

- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after: 3 days
Sponsored by: Mellanox Technologies


273267 18-Oct-2014 delphij

Add tunable vfs.zfs.space_map_blksz for space map's maximum block size.

MFC after: 2 weeks


273174 16-Oct-2014 davide

Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by: kmacy
Tested by: make universe


273158 16-Oct-2014 smh

Prevent ZFS leaking pool free space

When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query the
pool status during the hung import would not return as the import holds
the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is where
this lost space will be counted.

MFC after: 3 days
Sponsored by: Multiplay


273060 13-Oct-2014 delphij

Use write_psize instead of write_asize when doing vdev_space_update.
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.

Obtained from: FreeNAS (issue #6239)
MFC after: 2 weeks


273026 13-Oct-2014 delphij

Add a tunable for arc_shrink_shift (vfs.zfs.arc_shrink_shift) that
controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed
when there is memory pressure.

Submitted by: Richard Kojedzinszky <krichy at tvnetwork.hu>
MFC after: 2 weeks


272810 09-Oct-2014 delphij

MFV r272804:

Refactor the code and stop restore_object from creating two transactions.

Illumos issue:
3693 restore_object uses at least two transactions to restore an object

MFC after: 2 weeks


272809 09-Oct-2014 delphij

MFV r272803:

Illumos issue:
5175 implement dmu_read_uio_dbuf() to improve cached read performance

MFC after: 2 weeks


272708 07-Oct-2014 avg

l2arc_write_buffers: reduce headroom value

FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

headroom determines how much data is scanned on a single list
during each run of the l2arc feed thread.
Because FreeBSD has more lists we proportionally decrease the limit.

Reviewed by: Brendan Gregg (earlier version)
MFC after: 2 weeks
Sponsored by: HybridCluster


272707 07-Oct-2014 avg

revert r272702: wrong (earlier) change was committed


272702 07-Oct-2014 avg

reduce L2ARC_WRITE_SIZE on FreeBSD

FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which
defines limits on how much data is scanned and written to a cache device
during each run of the l2arc feed thread. The limits are applied on the
per buffer list basis.
Because FreeBSD has more lists we proportionally reduce the limits.

Reviewed by: Brendan Gregg (earlier version)
MFC after: 2 weeks
Sponsored by: HybridCluster


272601 06-Oct-2014 delphij

MFV r272591:

Use loaned ARC buffer for zfs receive to avoid copy.

Illumos issue:
5162 zfs recv should use loaned arc buffer to avoid copy

MFC after: 2 weeks


272598 06-Oct-2014 delphij

MFV r272585:

Split the godfather zio into CPU number's to reduce lock
contention.

Illumos issue:
5176 lock contention on godfather zio

MFC after: 2 weeks


272584 06-Oct-2014 delphij

MFV r272501:

Illumos issue:
5177 remove dead code from dsl_scan.c

MFC after: 2 weeks


272583 06-Oct-2014 delphij

MFV r272500:

Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT
when cloning. This prevents DS_FLAG_DEFER_DESTROY being inherited from a
clone that is marked for deferred destroy, which causes snapshots of the
clone being destroyed when getting a hold or clone.

Illumos issue:
5150 zfs clone of a defer_destroy snapshot causes strangeness

MFC after: 1 week


272527 04-Oct-2014 delphij

Don't make nested definition for range_seg_cache.

Reported by: ian
MFC after: 1 week
X-MFC-With: r272506


272511 04-Oct-2014 delphij

MFV r272499:

Illumos issue:
5174 add sdt probe for blocked read in dbuf_read()

MFC after: 2 weeks


272510 04-Oct-2014 delphij

Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system
administrator to toggle whether ZFS should ignore UNMAP requests.

Illumos issue:
5149 zvols need a way to ignore DKIOCFREE

MFC after: 2 weeks


272509 04-Oct-2014 delphij

Diff reduction with upstream. The code change is not really applicable
to FreeBSD.

Illumos issue:
5148 zvol's DKIOCFREE holds zfsdev_state_lock too long

MFC after: 1 month


272507 04-Oct-2014 delphij

MFV r272496:

Add tunable for number of metaslabs per vdev
(vfs.zfs.vdev.metaslabs_per_vdev). The default remains
at 200.

Illumos issue:
5161 add tunable for number of metaslabs per vdev

MFC after: 2 weeks


272506 04-Oct-2014 delphij

MFV r272495:

In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in
response of memory pressure.

Illumos issue:
5163 arc should reap range_seg_cache

MFC after: 1 week


272504 04-Oct-2014 delphij

MFV r272494:

Make space_map_truncate() always do space_map_reallocate(). Without
this, setting space_map_max_blksz would cause panic for existing pool,
as dmu_objset_set_blocksize would fail if the object have multiple blocks.

Illumos issues:
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
spacemap_histogram feature

MFC after: 2 weeks


272483 03-Oct-2014 smh

Refactor ZFS ARC reclaim checks and limits

Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay


272474 03-Oct-2014 smh

Fix various issues with zvols

When performing snapshot renames we could deadlock due to the locking
in zvol_rename_minors. In order to avoid this use the same workaround
as zvol_open in zvol_rename_minors.

Add missing zvol_rename_minors to dsl_dataset_promote_sync.

Protect against invalid index into zv_name in zvol_remove_minors.

Replace zvol_remove_minor calls with zvol_remove_minors to ensure
any potential children are also renamed.

Don't fail zvol_create_minors if zvol_create_minor returns EEXIST.

Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we
don't call zvol_remove_minors when zfs_unmount_snap fails.

PR: 193803
MFC after: 1 week
Sponsored by: Multiplay


272467 03-Oct-2014 araujo

Fix failures and warnings reported by newpynfs20090424 test tool.
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.

Submitted by: Bart Hsiao <bart.hsiao@gmail.com>
Reworked by: myself
Reviewed by: rmacklem
Approved by: rmacklem
Sponsored by: QNAP Systems Inc.


272359 01-Oct-2014 will

zfsvfs_create(): Refuse to mount datasets whose names are too long.

This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently
still in projects/zfsd rather than head).

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
- zfsvfs_create(): Check whether the objset name fits into
statfs.f_mntfromname, and return ENAMETOOLONG if not. Although
the filesystem can be unmounted via the umount(8) command, any
interface that relies on iterating on statfs (e.g. libzfs) will
fail to find the filesystem by its objset name, and thus assume
it's not mounted. This causes "zfs unmount", "zfs destroy",
etc. to fail on these filesystems, whether or not -f is passed.

MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 974872 on 2013/08/09


272324 30-Sep-2014 delphij

Fix a mismerge in r260183 which prevents snapshot zvol devices being
removed and re-instate the fix in r242862.

Reported by: Leon Dang <ldang nahannisys com>, smh
MFC after: 3 days


271798 18-Sep-2014 will

Remove debug.zfs_flags in favor of the new vfs.zfs.debug_flags.
Replace TUNABLE_INT with CTLFLAG_RWTUN.

Submitted by: avg (debug.zfs_flags removal), smh (TUNABLE_INT replacement)


271788 18-Sep-2014 will

Enable ZFS debug flags to be modified via vfs.zfs.debug_flags.

This is primarily only of interest to ZFS developers, but it makes it
easier to get additional debugging.

Submitted by: gibbs
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)


271785 18-Sep-2014 will

Reorder sysctls for spa.c global tunables; add sysctl for ccw_retry_interval.

MFC after: 1 month
Sponsored by: Spectra Logic


271781 18-Sep-2014 will

bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.

If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires. This call
does not invoke any calls to bpobj_close() itself.

This bug doesn't have any known vector, but was found on inspection.

MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD: r1050998 on 2014/03/26


271754 18-Sep-2014 smh

Remove unused ZFS ARC functions

* arc_data_buf_alloc
* arc_data_buf_free

MFC after: 1 week
Sponsored by: Multiplay


271589 14-Sep-2014 smh

Added missing ZFS sysctls
* vfs.zfs.vdev.async_write_active_min_dirty_percent
* vfs.zfs.vdev.async_write_active_max_dirty_percent

Added validation of min / max for ZFS sysctl
* vfs.zfs.dirty_data_max_percent

MFC after: 3 days


271536 13-Sep-2014 delphij

MFV r271518:

Correctly report hole at end of file.

When asked to find a hole, the DMU sees that there are no holes in the
object, and returns ESRCH. The ZPL interprets this as "no holes before
the end of the file", and therefore inserts the "virtual hole" at the
end of the file. Because DMU and ZPL have different ideas of where the
end of an object/file is, we will end up returning the end of file,
which is generally larger, instead of returning the end of object.

The fix is to handle the "virtual hole" in the DMU. If no hole is found,
the DMU will return a hole at the end of the file, rather than an error.

Illumos issue:
5139 SEEK_HOLE failed to report a hole at end of file

MFC after: 1 week


271534 13-Sep-2014 delphij

MFV r271517:

In zil_claim, don't issue warning if we get EBUSY (inconsistent) when
opening an objset, instead, ignore it silently.

Illumos issue:

5140 message about "%recv could not be opened" is printed when booting after crash

MFC after: 1 week


271532 13-Sep-2014 delphij

MFV r271515:

Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to
limit how many blocks can be free'ed before a new transaction group is
created. The default is no limit (infinite), but we should probably have
a lower default, e.g. 100,000.

With this limit, we can guard against the case where ZFS could run out of
memory when destroying large numbers of blocks in a single transaction
group, as the entire DDT needs to be brought into memory.

Illumos issue:
5138 add tunable for maximum number of blocks freed in one txg

MFC after: 2 weeks


271528 13-Sep-2014 delphij

MFV r271512:

Illumos issue:
5136 fix write throttle comment in dsl_pool.c

MFC after: 2 weeks


271526 13-Sep-2014 delphij

MFV r271510:

Enforce 4K as smallest indirect block size (previously the smallest
indirect block size was 1K but that was never used).

This makes some space estimates more accurate and uses less memory
for some data structures.

Illumos issue:
5141 zfs minimum indirect block size is 4K

MFC after: 2 weeks


271429 11-Sep-2014 smh

Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

MFC after: 1 week


271387 10-Sep-2014 glebius

Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().


271308 09-Sep-2014 mav

Make ZVOL writes in device mode support IO_SYNC flag.

MFC after: 1 month


271226 07-Sep-2014 delphij

MFV r271223:

In dnode_sync(), do dnode_increase_indirection() before processing
the dn_next_nblkptr.

Illumos issue:
5117 space map reallocation can cause corruption

MFC after: 3 days


270871 31-Aug-2014 peter

Move the restored #ifdef i386 test back inside the #ifdef _KERNEL block
where it originally was.


270861 30-Aug-2014 smh

Ensure that ZFS ARC free memory checks include cached pages

Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider

PR: 187594
Reviewed by: peter
X-MFC-With: r270759
Review: D700
Sponsored by: Multiplay


270834 30-Aug-2014 mjg

Add missing proctree locking to fill_kinfo_proc consumers.

This fixes r270444.

Pointy hat: mjg
Reported by: many
MFC after: 1 week


270759 28-Aug-2014 smh

Refactor ZFS ARC reclaim logic to be more VM cooperative

Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay


270383 22-Aug-2014 delphij

Instead of using timestamp in the AVL, use the memory address when
comparing.

Illumos issue:
5095 panic when adding a duplicate dbuf to dn_dbufs

MFC after: 3 days


270382 22-Aug-2014 delphij

MFV r270197:

Illumos issue:
5066 remove support for non-ANSI compilation
5068 Remove SCCSID() macro from <macros.h>

MFC after: 2 weeks


270248 20-Aug-2014 delphij

MFV r270196:

Illumos issue:
5047 don't use atomic_*_nv if you discard the return value

MFC after: 2 weeks


270247 20-Aug-2014 delphij

MFC r270195:

Illumos issue:
5045 use atomic_{inc,dec}_* instead of atomic_add_*

MFC after: 2 weeks


270239 20-Aug-2014 delphij

MFV r270193:

Illumos issues:
5042 stop using deprecated atomic functions

MFC after: 2 weeks


269543 05-Aug-2014 delphij

MFV r269542:

In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation. This fixes a panic when removing log device.

Illumos issue:
5049 panic when removing log device

Author: Alex Reece <alex@delphix.com>
MFC after: 2 weeks


269525 04-Aug-2014 markj

Return 0 for the PPID of threads in process 0, as process 0 doesn't have a
parent process.

MFC after: 2 weeks


269466 03-Aug-2014 delphij

Revert r269404 and use cpu_ticks() for dbuf allocation.

Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().

Reviewed by: mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after: 2 weeks


269431 02-Aug-2014 delphij

MFV r269427:

In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.

This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.

Illumos issue:
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>

MFC after: 2 weeks


269407 01-Aug-2014 smh

Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods

This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.

We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.

zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.

Cleanup flow in vdev_geom_io_start while I'm here.

Also fix some cases not using SET_ERROR(..)

MFC after: 2 weeks
X-MFC-With: r265321


269230 29-Jul-2014 delphij

MFV r269224:

Increase default ARC buf_hash_table size. When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.

A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).

Illumos issue:
5034 ARC's buf_hash_table is too small

MFC after: 2 weeks


269229 29-Jul-2014 delphij

MFV r269223:

Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets

MFC after: 2 weeks


269222 29-Jul-2014 delphij

Reschedule the 'deadman' callout after handling, this makes our
code behave more like it is on Solaris.

Reported by: avg
Reviewed by: avg, mav (but bugs are mine)

Differential Revision: https://phabric.freebsd.org/D457


269189 28-Jul-2014 kib

Initialize zfs vnode v_hash when the vnode is allocated, instead of
postponing it to zfs_vget(). zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().

Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.

Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


269138 26-Jul-2014 delphij

Add two sysctls for newly added tunables.

MFC after: 2 weeks


269118 26-Jul-2014 delphij

MFV r269010:

Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric

MFC after: 2 weeks


269117 26-Jul-2014 mav

Make sysctls under vfs.zfs.zfetch writeable.

I don't see any reason for them to be read-only, while tuning them without
reboot is much more convenient for experiments.

MFC after: 2 weeks


269093 25-Jul-2014 delphij

Transform the I/O when vdev_physical_ashift is greater than
SPA_MINBLOCKSHIFT.

MFC after: 2 weeks


269086 25-Jul-2014 delphij

As of r268075, the responsibility of rounding up buffer to optimal size have
been transferred from zio_compress_data to its caller. Therefore, passing
the 'minblocksize' down will be a no-op.

Eliminate the parameter to reduce diff against upstream.

MFC after: 2 weeks


268980 22-Jul-2014 delphij

Correct typo introduced with r268855.

MFC after: 10 days
X-MFC with: r268855


268865 19-Jul-2014 delphij

Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.

Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.

Illumos issue:
5008 lock contention (rrw_exit) while running a read only load

MFC after: 2 weeks


268859 18-Jul-2014 delphij

MFV r268851:

When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).

Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting

MFC after: 2 weeks


268858 18-Jul-2014 delphij

MFV r268850:

Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.

The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.

Illumos issue:
4631 zvol_get_stats triggering too many reads

MFC after: 2 weeks


268855 18-Jul-2014 delphij

MFV r268848:

Instead of asserting all zio's be properly aligned, only assert
on the logical ones.

Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.

This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).

While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.

Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe

MFC after: 2 weeks


268720 15-Jul-2014 delphij

MFV r268714:

Improve extreme rewind import.

When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.

For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:

vfs.zfs.spa_load_verify_maxinflight
vfs.zfs.spa_load_verify_metadata
vfs.zfs.spa_load_verify_data

The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.

Make 'zpool import -T' imply scrub.

Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.

Skip txg's for which there is no uberblock when doing extreme rewind.

Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.

Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice

MFC after: 2 weeks


268713 15-Jul-2014 delphij

MFV r268702:

Add missing *_destroy() calls in various places with ZFS.

Illumos issue:
4975 missing mutex_destroy() calls in zfs

MFC after: 2 weeks


268473 09-Jul-2014 delphij

MFV r268455:

Use reserved space for ZFS administrative commands.

We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.

Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.

A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.

MFC after: 2 weeks


268464 09-Jul-2014 delphij

MFV r268452:

Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue: 4950 files sometimes can't be removed from a full
filesystem
MFC after: 2 weeks


268420 08-Jul-2014 mav

Remove IO_SYNC flag when writing extended file attributes on ZFS.

While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS. UFS does not do sync there too.

Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.

Reviewed by: delphij, jpaetzel, silence on fs@
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


268351 07-Jul-2014 marcel

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


268290 05-Jul-2014 pfg

Merge from OpenSolaris (24-Jul-2010):

6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink

This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.

MFC after: 1 week


268231 03-Jul-2014 pfg

Merge from OpenSolaris (30-Jun-2009):

6851093 system drops to kmdb with anonymous dtrace probes + kmdb

This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.

MFC after: 1 week


268230 03-Jul-2014 pfg

Merge from OpenSolaris (22-Apr-2008):

6823388 DTrace ioctl handlers must validate all structure members

MFC after: 1 week


268224 03-Jul-2014 pfg

Merge from OpenSolaris (20-Apr-2008):

6822482 DOF validation needs to handle loadable sections flagged as unloadable

MFC after: 1 week


268178 02-Jul-2014 mav

Fix bug in sync control in new "dev" mode of ZVOL (r265678).

Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".

MFC after: 3 days


268130 01-Jul-2014 pfg

Merge from OpenSolaris (15-Sep-2008):

6735480 race between probe enabling and provider registration

MFC after: 1 week


268128 01-Jul-2014 delphij

MFV r268122:

4929 want prevsnap property

illumos/illumos-gate@b461c7460e5e77cf65f00151162e654220c6e2fb

MFC after: 2 weeks


268126 01-Jul-2014 delphij

MFV r268121:

4924 LZ4 Compression for metadata

illumos/illumos-gate@b8289d24d866c1af02d7007348f7f057693c15d3

MFC after: 2 weeks


268125 01-Jul-2014 pfg

Small merges from OpenSolaris:

These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:

6699767 panic in spec_open()

6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone

MFC after: 3 days


268123 01-Jul-2014 delphij

MFV r268119:

4914 zfs on-disk bookmark structure should be named *_phys_t

illumos/illumos-gate@7802d7bf98dec568dadf72286893b1fe5abd8602

MFC after: 2 weeks


268116 01-Jul-2014 delphij

- Fix handling of "new" style of ioctl in compatiblity mode [1];
- Reorganize code and reduce diff from upstream;
- Improve forward compatibility shims for previous kernel;

Reported by: sbruno [1]
X-MFC-With: r268075


268097 01-Jul-2014 pfg

Revert r268007, and re-adapt MFV r260708:

4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c required -Wno-cast-qual for powerpc.

MFC after: 2 weeks


268086 01-Jul-2014 delphij

MFV r267570:

4756 metaslab_group_preload() could deadlock

illumos/illumos-gate@30beaff42d8240ebf5386e8b7a14e3d137a1631f

MFC after: 2 weeks


268085 01-Jul-2014 delphij

MFV r267569:

4897 Space accounting mismatch in L2ARC/zpool

illumos/illumos-dist@3038a2b421b40dc5ac11cd88423696618584f85a

MFC after: 2 weeks


268082 01-Jul-2014 delphij

MFV r267567:

4881 zfs send performance degradation when embedded block pointers are
encountered

illumos/illumos-gate@06315b795c0d54f0228e0b8af497a28752dd92da

MFC after: 2 weeks


268079 01-Jul-2014 delphij

MFV r267566:

4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after: 2 weeks


268075 01-Jul-2014 delphij

MFV r267565:

4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

MFC after: 2 weeks


268007 28-Jun-2014 pfg

Revert r267869:

MFV r260708
4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c broke the build on powerpc.

Reported by: bz, rpaulo and tinderbox.
Pointyhat: me


267992 28-Jun-2014 hselasky

Pull in r267961 and r267973 again. Fix for issues reported will follow.


267985 27-Jun-2014 gjb

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


267961 27-Jun-2014 hselasky

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


267942 26-Jun-2014 rpaulo

MFV illumos

4471 DTrace count() with histogram
4472 DTrace full width distribution histograms
4473 DTrace frequency trails

MFC after: 2 weeks


267941 26-Jun-2014 rpaulo

MFV illumos

4474 DTrace Userland CTF Support
4475 DTrace userland Keyword
4476 DTrace tests should be better citizens
4479 pid provider types
4480 dof emulation is missing checks

MFC after: 2 weeks


267937 26-Jun-2014 rpaulo

MFV illumos

4477 DTrace should speak JSON

MFC after: 2 weeks


267929 26-Jun-2014 rpaulo

MFV illumos r266986:

2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access

MFC after: 2 weeks


267925 26-Jun-2014 rpaulo

Revert r267898.


267898 26-Jun-2014 rpaulo

Bring the following change from the illumos-joyent repository:

commit 78e24ab6803bbe11ba37642624e1498ede5b239d
Author: Bryan Cantrill <bryan@joyent.com>
Date: Thu Oct 31 01:20:54 2013

OS-1688 DTrace count() with histogram
OS-2360 DTrace full width distribution histograms
OS-2361 DTrace frequency trails

MFC after: 2 weeks


267869 25-Jun-2014 pfg

MFV r260708

4427 pid provider rejects probes with valid UTF-8 names

This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.

Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5

Reference:
https://www.illumos.org/issues/4427

Obtained from: Illumos
MFC after: 2 weeks


267851 25-Jun-2014 davide

Continue the crusade towards a dev_clone()-free kernel, removing its
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.

Reviewed by: markj
Approved by: markj
MFC after: never


267761 23-Jun-2014 markj

Fix some bugs when fetching probe arguments in i386. Firstly ensure that
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.

MFC after: 3 weeks


267267 09-Jun-2014 smh

Removed stale comment about multi-vdev root pool config not working

MFC after: 1 week


267038 04-Jun-2014 bdrewery

- Naively fix build by partially reverting r267029 to still use
gethrtime() when building libzpool.

X-MFC-With: 267029


267029 03-Jun-2014 mav

Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
selection. gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.

MFC after: 2 weeks


266915 31-May-2014 delphij

MFV 266913+266914:

3897 zfs filesystem and snapshot limits (fix leak)
4901 zfs filesystem/snapshot limit leaks

MFC after: 3 days


266771 27-May-2014 delphij

MFV r266766:

Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after: 2 weeks


266533 22-May-2014 allanjude

Improve sysctl descriptions for new ZFS sysctls:
vfs.zfs.dirty_data_max
vfs.zfs.dirty_data_max_max
vfs.zfs.dirty_data_sync

Reviewed by: smh
Approved by: wblock (mentor)


266497 21-May-2014 smh

Added sysctls / tunables for ZFS dirty data tuning

Added the following new sysctls / tunables:
* vfs.zfs.dirty_data_max
* vfs.zfs.dirty_data_max_max
* vfs.zfs.dirty_data_max_percent
* vfs.zfs.dirty_data_sync
* vfs.zfs.delay_min_dirty_percent
* vfs.zfs.delay_scale

PR: kern/189865
MFC after: 2 weeks


265458 06-May-2014 delphij

Import George Wilson's change for Illumos #4730:

4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>

Original author: George Wilson

MFC after: 3 days


265321 04-May-2014 smh

Use a zio flag to prevent recursion of vdev_queue_io_done which can
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.

This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.

MFC after: 1 month
X-MFC-With: r265152


265253 03-May-2014 smh

Don't treat TRIM requests returning ENOTSUP as an unexpected error.

MFC after: 1 month
X-MFC-With: r265152


265218 02-May-2014 smh

Removed pointless / duplicated call to trim_map_first.

MFC after: 1 month
X-MFC-With: r265152


265152 30-Apr-2014 smh

Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority

The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after: 1 month


265046 28-Apr-2014 smh

Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.

This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.

MFC after: 2 days


264885 24-Apr-2014 smh

Eliminate duplicate checks in vdev_geom_io_intr error handling

MFC after: 1 month


264850 24-Apr-2014 smh

Add the ability to set a minimum ashift size for ZFS pool creation or root level
vdev addition.

Change max_auto_ashift sysctl to error when an invalid value is requested instead
of silently limiting it.


264836 23-Apr-2014 delphij

MFV r264830:

4745 fix AVL code misspellings

MFC after: 2 weeks


264835 23-Apr-2014 delphij

MFV r264829:

3897 zfs filesystem and snapshot limits

MFC after: 2 weeks


264671 18-Apr-2014 delphij

MFV r264668:

4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

illumos/illumos@b6240e830b871f59c22a3918aebb3b36c872edba

MFC after: 2 weeks


264670 18-Apr-2014 delphij

MFV r264667:

4752 fan out read zio taskqs

illumos/illumos-gate@1b497ab83e8f1c58bba5da59c649207a442a4720


264669 18-Apr-2014 delphij

MFV r264666:

4374 dn_free_ranges should use range_tree_t

illumos/illumos-gate@bf16b11e8deb633dd6c4296d46e92399d1582df4

MFC after: 2 weeks


264434 14-Apr-2014 markj

DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.

In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.

This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.

With this change it's possible to trace all libc functions in a program,
e.g. with

pid$target:libc.so.*::entry {@[probefunc] = count();}

Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.

Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after: 6 weeks


264392 13-Apr-2014 davide

Fix a panic in zfs_rename().
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.

While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.

Discussed with: avg
Reviewed by: pjd


264341 11-Apr-2014 mav

Create zvol devices on zfs clone.

While big and shiny patch is not ready, it is better to have something.

PR: kern/178999
MFC after: 1 week


264193 06-Apr-2014 mav

In addition to r264077, tell GEOM that we do support BIO_DELETE now.


264145 05-Apr-2014 mav

Add property and sysctl to control how ZVOLs are exposed to OS.

New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.

MFC after: 1 month
Sponsored by: iXsystems, Inc.


264086 03-Apr-2014 mav

MFV r258922:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564dfc644f4bfaef1ce26d343b44cf68bc5

It is irrelevant for FreeBSD, just reducing diff.


264077 03-Apr-2014 mav

Add BIO_DELETE support to ZVOL.

It is an adapted merge from the vendor branch of:
701 UNMAP support for COMSTAR (in part related to ZFS)
2130 zvol DKIOCFREE uses nested DMU transactions


264040 02-Apr-2014 pfg

MFV r258379;

4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included

Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59

Reference:
https://www.illumos.org/issues/4248
https://www.illumos.org/issues/4249

Obtained from: Illumos
MFC after: 1 month


263620 22-Mar-2014 bdrewery

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


263118 13-Mar-2014 mav

Report ZVOL block size as GEOM stripesize.

MFC after: 2 weeks


262990 11-Mar-2014 delphij

MFV r262983:

4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!

illumos/illumos-gate@2144b121c08e0eb676cc6ca4662ebbc9f9c22fe3


262676 02-Mar-2014 delphij

All callers of static method load_nvlist() in spa.c handles error case,
so there is no reason to assert that we won't hit an error. Instead,
just return that error to caller and have the upper layer handle it.

Obtained from: FreeNAS
Reported by: rodrigc
Reviewed by: Matthew Ahrens
MFC after: 2 weeks


262665 01-Mar-2014 markj

Expose a few DTrace parameters as sysctls under kern.dtrace and add
descriptions for several existing sysctls.

PR: 187027
Submitted by: Fedor Indutny <fedor@indutny.com> (original version)
MFC after: 2 weeks


262661 01-Mar-2014 markj

Fix emulation of call and jmp instructions on i386 and for 32-bit processes
on amd64.

Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after: 2 weeks


262596 28-Feb-2014 markj

4478 dtrace_dof_maxsize is far too small

illumos/illumos-gate@d339a29bb4765c4b6883a935cf69b669cd05bca0

PR: 187027
MFC after: 1 week


262542 27-Feb-2014 markj

Move some files that are identical on i386 and amd64 to an x86 subdirectory
rather than keeping duplicate copies.

Discussed with: avg
MFC after: 1 week


262330 22-Feb-2014 markj

1452 DTrace buffer autoscaling should be less violent

illumos/illumos-gate@6fb4854bed54ce82bd8610896b64ddebcd4af706

This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.

MFC after: 1 month


261620 08-Feb-2014 delphij

MFV r261619:

4574 get_clones_stat does not call zap_count in non-debug kernel

zap_count(...) is never called in non-DEBUG kernel.
As result "count" variable is always 0, and "goto fail" is always
reached. This means get_clones_stat function never makes up list
of clones for "clones" properties.

MFC after: 2 weeks


260835 18-Jan-2014 delphij

MFV r260834:

Fix memory leak of compressed buffers in l2arc_write_done (Illumos
#3995).


260812 17-Jan-2014 avg

traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT

This is done to ensure that visited object IDs are always increasing.
Also, pass correct object ID to prefetch_dnode_metadata for
os_groupused_dnode.

Without this change we would hit an assert if traversal was paused on
a GROUPUSED object, which is unlikely but possible.

Apparently the same change was independently developed by Deplhix.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
MFC after: 10 days
Sponsored by: HybridCluster


260717 16-Jan-2014 avg

fix a build problem with INVARIANTS enabled introduced in r260704

Reported by: glebius
MFC after: 5 days
X-MFC with: r260704


260713 16-Jan-2014 avg

fix a bug in ZFS mirror code for handling multiple DVAa

The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster


260711 16-Jan-2014 avg

Revert r260705: wrong patch committed by accident

An earlier, less efficient version was committed by accident.


260706 16-Jan-2014 avg

zfs_deleteextattr: name buffer from namei is needed by zfs_rename

If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.

MFC after: 6 days
Sponsored by: HybridCluster


260705 16-Jan-2014 avg

fix a bug in ZFS mirror code for handling multiple DVAa

The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster


260704 16-Jan-2014 avg

zfs: getnewvnode_reserve must be called outside of a zfs transaction

Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed. vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.

zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context. Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.

While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.

Reported by: adrian
Tested by: adrian
MFC after: 17 days
Sponsored by: HybridCluster


260236 03-Jan-2014 mav

In dmu_zfetch_stream_reclaim() replace division with multiplication and
move it out of the loop and lock.


260185 02-Jan-2014 delphij

MFV r260155:

When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark. This would lead to double free of bp's which will lead to
space map corruption.

Instead of tolerating and allowing the corruption, panic immediately.

See Illumos #4390 for more details.

4391 panic system rather than corrupting pool if we hit bug 4390

Illumos/illumos-gate@8b36997aa24d9817807faa4efa301ac9c07a2b78

MFC after: 2 weeks


260183 02-Jan-2014 delphij

MFV r260154 + 260182:

4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools

Illumos/illumos-gate@78f171005391b928aaf1642b3206c534ed644332

MFC after: 2 weeks


260181 02-Jan-2014 delphij

Fix build on platforms where atomic_swap_64 is not available.


260157 01-Jan-2014 delphij

MFV r260153:

4121 vdev_label_init should treat request as succeeded when pool
is read only

Illumos/illumos-gate@973c78e94bf9634782164382c9e291bf81161fa5

MFC after: 2 weeks


260150 01-Jan-2014 delphij

MFV r259170:

4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47bfcd2ad9bf501faec8e75c08095e4f

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.

MFC after: 2 weeks


260141 31-Dec-2013 delphij

MFV r258385:

(Note: this change is not applicable to FreeBSD and the file
is not included in build. It's integrated for completeness).

4128 disks in zpools never go away when pulled

illumos/illumos-gate@39cddb10a31c1c2e66aed69e6871d09caa4c8147

MFC after: 2 weeks


260138 31-Dec-2013 delphij

MFV r242733:

3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message

illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1

FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use. vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.

MFC after: 2 weeks


260131 31-Dec-2013 markj

Revert r260091. The vmem calls seem to be slower than the *_unr() calls that
they replaced, which is important considering that probe IDs are allocated
during process startup for USDT probes.


260091 30-Dec-2013 markj

Now that vmem(9) is available, use vmem arenas to allocate probe and
aggregation IDs, as is done in the upstream illumos code. This still
requires some FreeBSD-specific code, as our vmem API is not identical to the
one in illumos.

Submitted by: Mike Ma <mikemandarine@gmail.com>


259813 24-Dec-2013 delphij

MFV r258374:

4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7808606888f8f92715629ff3ba555b9

MFC after: 2 weeks


259811 24-Dec-2013 delphij

MFV r258373:

4168 ztest assertion failure in dbuf_undirty

4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state

illumos/illumos-gate@7fdd916c474ea52896c671bbe7b56ba34a1ca132

MFC after: 2 weeks


259576 18-Dec-2013 pjd

MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0

illumos/illumos-gate@bb411a08b05466bfe0c7095b6373bbc1587e259a

MFC after: 3 days


259535 18-Dec-2013 markj

The fasttrap fork handler is responsible for removing tracepoints in the
child process that were inherited from its parent. However, this should
not be done in the case of a vfork, since the fork handler ends up removing
the tracepoints from the shared vm space, and userland DTrace probes in the
parent will no longer fire as a result.

Now the child of a vfork may trigger userland DTrace probes enabled in its
parent, so modify the fasttrap probe handler to handle this case and handle
the child process in the same way that it would handle the traced process.
In particular, if once traces function foo() in a process that vforks, and
the child calls foo(), fasttrap will treat this call as having come from the
parent. This is the behaviour of the upstream code.

While here, add #ifdef guards to some code that isn't present upstream.

MFC after: 1 month


259240 12-Dec-2013 asomers

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.

Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.

Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.

Approved by: ken (mentor)
Sponsored by: Spectra Logic Corporation
MFC after: 4 weeks


259211 11-Dec-2013 markj

Correct the check for errors from proc_rwmem().

MFC after: 2 weeks


259168 10-Dec-2013 mav

Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.


259052 06-Dec-2013 delphij

Expose spa_asize_inflation.

X-MFC-With: r258632


258746 29-Nov-2013 avg

zfs: add zfs_freebsd_putpages

this should be more optimal than writing pages one-by-one via zfs_write ->
update_pages in the case of multi-page putpages call

MFC after: 16 days


258745 29-Nov-2013 avg

zfs: add dmu_write_pages variant for freebsd

The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.

MFC after: 15 days


258744 29-Nov-2013 avg

zfs: make zfs_map_page / zfs_unmap_page public

MFC after: 15 days


258743 29-Nov-2013 avg

drop ZUT_OBJ, zfs unit testing driver never materialzied in freebsd

MFC after: 5 days


258739 29-Nov-2013 avg

zfs mappedread_sf: assert that a page is never partially valid

ZFS never partially validates or invalidates a page.
The higher level VM should not do that either.
mappedread_sf correct operation depends on a page being either fully
valid or invalid.

MFC after: 7 days


258720 28-Nov-2013 avg

MFV r258665: 4347 ZPL can use dmu_tx_assign(TXG_WAIT)

illumos/illumos-gate@e722410c49fe67cbf0f639cbcc288bd6cbcf7dd1

MFC after: 9 days
X-MFC after: r258632


258717 28-Nov-2013 avg

MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control

4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab

illumos/illumos-gate@0713e232b7712cd27d99e1e935ebb8d5de61c57d

Note that some tunables have been removed and some new tunables have
been added. Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).

MFC after: 11 days
Sponsored by: HybridCluster [merge]


258704 28-Nov-2013 avg

fix a serious bug in r258632: offset parameter must be set in zio

In illumos all ioctl zio-s are "global" at the moment. That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.

Pointyhat to: avg
Tested by: sbruno
MFC after: 3 days
X-MFC with: r258632
Sponsored by: HybridCluster


258642 26-Nov-2013 avg

fix debug.zfs_flags sysctl description in r258638

Pointyhat to: avg
MFC after: 3 days


258638 26-Nov-2013 avg

expose zfs_flags as debug.zfs_flags r/w tunable and sysctl

This knob is purposefully hidden under debug.

MFC after: 5 days
Sponsored by: HybridCluster


258634 26-Nov-2013 avg

MFV r258376: 3964 L2ARC should always compress metadata buffers

illumos/illumos-gate@e4be62a2b74a8f09bb669217a1a39eee069b13a1

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258633 26-Nov-2013 avg

MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit

4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold

illumos/illumos-gate@22e30981d82a0b6dc89253596ededafae8655e00

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258632 26-Nov-2013 avg

MFV r255255: 4045 zfs write throttle & i/o scheduler performance work

illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet

MFC after: 10 days
Sponsored by: HybridCluster [merge]


258631 26-Nov-2013 avg

MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot

illumos/illumos-gate@ec94d32216ed5705f5176582355cc311cf848e73

MFC after: 9 days
Sponsored by: HybridCluster [merge]


258630 26-Nov-2013 avg

734 taskq_dispatch_prealloc() desired

943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e3be0c51e86f73096334611ae3a058e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after: 8 days


258628 26-Nov-2013 avg

opensolaris taskq: some cosmetic changes

- drop trailing whitespace
- remove redundant "extern" from function declarations
- remove unused macro

MFC after: 1 week


258597 25-Nov-2013 pjd

When append-only, immutable or read-only flag is set don't allow for
hard links creation. This matches UFS behaviour.

Reported by: Oleg Ginzburg <olevole@olevole.ru>
MFC after: 1 month


258389 20-Nov-2013 avg

MFV r258378: 4089 NULL pointer dereference in arc_read()

illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea

Tested by: adrian
MFC after: 4 days


258388 20-Nov-2013 avg

MFV r258377: 4088 use after free in arc_release()

illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772

MFC after: 5 days


258353 19-Nov-2013 avg

zfs page_busy: fix the boundaries of the cleared range

This is a fix for a regression introduced in r246293.

vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them. Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.

It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces. Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.

Reported by: George Hartzell <hartzell@alerce.com>,
Richard Todd <rmtodd@servalan.servalan.com>
Tested by: George Hartzell <hartzell@alerce.com>,
Reviewed by: kib
MFC after: 5 days


258342 19-Nov-2013 mav

Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.

On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage. It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.


258311 18-Nov-2013 asomers

opensolaris/uts/common/dtrace/fasttrap.c
Fix several problems that can cause panics on kldload and kldunload.

* kproc_create(fasttrap_pid_cleanup_cb, ...) gets called before
fasttrap_provs.fth_table gets allocated. This can lead to a panic
on module load, because fasttrap_pid_cleanup_cb references
fasttrap_provs.fth_table. Move kproc_create down after the point
that fasttrap_provs.fth_table gets allocated, and modify the error
handling accordingly.

* dtrace_fasttrap_{fork,exec,exit} weren't getting NULLed until
after fasttrap_provs.fth_table got freed. That caused panics on
module unload because fasttrap_exec_exit calls
fasttrap_provider_retire, which references
fasttrap_provs.fth_table. NULL those function pointers earlier.

* There wasn't any code to destroy the
fasttrap_{tpoints,provs,procs}.fth_table mutexes on module unload,
leading to a resource leak when WITNESS is enabled. Destroy those
mutexes during fasttrap_unload().

Reviewed by: markj
Approved by: ken (mentor)
Sponsored by: Spectra Logic
MFC after: 4 weeks


258294 18-Nov-2013 smh

Fix ZFS deadlock when sending a snapshot which is mounted.

MFC after: 1 week
Sponsored by: Multiplay


258291 18-Nov-2013 markj

The fasttrap ioctl used to create probes takes a variable-sized argument.
It was not being correctly copied into the kernel on FreeBSD, and as a
result, probes with multiple probe sites were not being created properly.
To fix this, change the ioctl definition so that the fasttrap ioctl handler
is responsible for copying in userland data.

Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after: 1 month


258137 14-Nov-2013 mav

Introduce allocation cache to store LZ4 compression contexts without kicking
VM subsystem twice for every written record.

Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.

This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.

Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>


257679 05-Nov-2013 markj

Use suword32 and suword64 instead of copyout(9). This fixes a bug in the
emulation of the call instruction caused by reversing the uaddr and kaddr
arguments when copying data out to userland: the suword* functions take the
uaddr as the first argument whereas copyout(9) takes the kaddr as the first
argument. This also partially undoes the fixes from r257143.

Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after: 1 month


257143 26-Oct-2013 markj

Fix a couple of bugs in the fasttrap emulation of a "push %rbp" instruction:
the code was trying to save the stack pointer rather than the frame pointer,
and the arguments to copyout(9) were reversed, so nothing ended up being
saved on the stack. This would cause process crashes when the pid provider
was being used to instrument calls of a function starting with this
instruction.

Reported by: symbolics@gmx.com
Tested by: symbolics@gmx.com (earlier version)
MFC after: 2 weeks


256956 23-Oct-2013 smh

Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay


256889 22-Oct-2013 smh

Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.

This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.

MFC after: 2 days


256880 22-Oct-2013 mav

Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by: iXsystems, Inc.
MFC after: 2 months


256822 21-Oct-2013 markj

When fetching function arguments out of a frame on amd64, explicitly select
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.

MFC after: 2 weeks


256571 16-Oct-2013 markj

Add a function, memstr, which can be used to convert a buffer of
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:

dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'

Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:

fbt::kern_execve:entry
{
printf("%s", memstr(args[1]->begin_argv, ' ',
args[1]->begin_envv - args[1]->begin_argv));
}

The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.

Tested by: Fabian Keil (earlier version)
MFC after: 2 weeks


256543 15-Oct-2013 jhibbits

Add fasttrap for PowerPC. This is the last piece of the dtrace/ppc puzzle.
It's incomplete, it doesn't contain full instruction emulation, but it should be
sufficient for most cases.

MFC after: 1 month


256259 10-Oct-2013 avg

MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free()

illumos change 14172:be36a38bac3d:
illumos ZFS issues:
4082 zfs receive gets EFBIG from dmu_tx_hold_free()

Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.

PR: kern/182570
Reported by: Keith White <kwhite@site.uottawa.ca>
Tested by: Keith White <kwhite@site.uottawa.ca>
Approved by: re (glebius)
MFC after: 1 week
X-MFC after: r254753


256148 08-Oct-2013 markj

Initialize and free the DTrace taskqueue in the dtrace module load/unload
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.

Submitted by: gibbs (original version)
Reviewed by: gibbs
Approved by: re (glebius)
MFC after: 2 weeks


256132 08-Oct-2013 delphij

Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by: Denis Ahrens <denis h3q com>
MFC after: 2 weeks
Approved by: re (gjb)


255753 21-Sep-2013 gibbs

Optimize the block size used on ZFS cache devices as is already done
for data and log devices.

Reported by: Dmitryy Makarov
Submitted by: smh
Reviewed by: gibbs
Approved by: re (delphij)
MFC after: 2 weeks


255750 21-Sep-2013 delphij

MFV r254750:

Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools

MFC after: 1 month
Approved by: re (ZFS blanket)


255748 20-Sep-2013 davide

Fixup cross-device rename checks in ZFS. Add a check for the case
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.

Reported by: avg, pjd
Reviewed by: pjd
Approved by: re (rodrigc)


255437 10-Sep-2013 delphij

MFV r247844 (illumos-gate 13975:ef6409bc370f)

Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by: re (ZFS blanket)


255226 05-Sep-2013 pjd

Add sysctl/tunables for various metaslab variables.


255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


254982 28-Aug-2013 delphij

Previously, both zfs_rename and zfs_link does a check on whether
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD). This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.

The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.

This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.

Reported by: many
Diagnostic work by: avg


254813 24-Aug-2013 markj

Rename the kld_unload event handler to kld_unload_try, and add a new
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.

Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.

Reviewed by: jhb


254757 24-Aug-2013 delphij

MFV r254749:

Don't hold dd_lock for long by breaking it when not doing dsl_dir
accounting. It is not necessary to hold the lock while manipulating
the parent's accounting, because there is no interface for userland
to see a consistent picture of both parent and child at the same
time anyway.

Illumos ZFS issues:
4046 dsl_dataset_t ds_dir->dd_lock is highly contended


254753 24-Aug-2013 delphij

MFV r254747:

Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive. This is a regression from FreeBSD r253821.

Illumos ZFS issues:
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive


254744 23-Aug-2013 delphij

MFV r254422:

Illumos DTrace issues:
3089 want ::typedef
3094 libctf should support removing a dynamic type
3095 libctf does not validate arrays correctly
3096 libctf does not validate function types correctly


254714 23-Aug-2013 avg

zfs: do not reject any operations on a pool just because it's a boot pool

Unlike the upstream FreeBSD supports booting to all kinds of pools.

Requested by: many
Tested by: sbruno
MFC after: 12 days


254711 23-Aug-2013 avg

zfs: inline and remove zfs_vnode_lock

It didn't serve any useful purpose, but obscured file and line information
useful for debugging.

MFC after: 5 days
X-MFC with: r254445


254649 22-Aug-2013 kib

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


254627 21-Aug-2013 ken

Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM

This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.

UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE

This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.

UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE

This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.

UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT

This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.

UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN

This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.

The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.

UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY

This flag means that the file may not written or
appended, but its attributes may be changed.

ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.

The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE

The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.

sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.

chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.

ls.1: Reference chflags(1) for a list of file flags
and their meanings.

strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.

chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.

Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.

zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.

ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.

msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.

It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.

After discussion with Bruce Evans, change
several things in the msdosfs behavior:

Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.

Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.

Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.

smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.

This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.

We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.

stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.

The definition of UF_HIDDEN is the same as
the MacOS X definition.

Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).

ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.

These new flags are only stored, UFS does
not take any action if the flag is set.

Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)


254608 21-Aug-2013 gibbs

Add kstat entries for ZFS compression statistics.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
Add module lifetime functions to allocate and teardown
state data.

Report:
- Compression attempts.
- Buffers found to be empty.
- Compression calls that are skipped because
the data length is already less than or
equal to the minimum block length.
- Compression attempts that fail to yield a 12.5%
compression ratio.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
Add calls to the zio_compress.c module's init and fini
functions.

Sponosred by: Spectra Logic Corporation
MFC after: 2 weeks


254591 21-Aug-2013 gibbs

Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.

2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.

Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.

Populate the new fields in the vdev_stat_t structure.

Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.

Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.

In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

For each sub-optimally configured leaf vdev, report configured
and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.

Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.


254587 21-Aug-2013 delphij

MFV r254421:

Illumos ZFS issues:
3996 want a libzfs_core API to rollback to latest snapshot


254585 20-Aug-2013 delphij

MFV r254220:

Illumos ZFS issues:
4039 zfs_rename()/zfs_link() needs stronger test for XDEV


254445 17-Aug-2013 pjd

Remove redundant variable.


254309 14-Aug-2013 markj

Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by: jhb
X-MFC with: r254266


254268 13-Aug-2013 markj

FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR: 166927, 166926, 166928
Reported by: davide [1]
Reviewed by: avg, trociny (earlier version)
MFC after: 1 month


254198 11-Aug-2013 rpaulo

fasttrap_fork(): unlock the processes before removing the tracepoints.

In the future, we'll need to come up with new proc_*() functions that accept
locked processes. For now, this prevents postgresql + DTrace from crashing the
system.

MFC after: 1 month


254138 09-Aug-2013 attilio

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


254112 08-Aug-2013 delphij

MFV r254079:

Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting


254077 07-Aug-2013 delphij

MFV r254071:

Fix a regression introduced by fix for Illumos bug #3834. Quote from
Matthew Ahrens on the Illumos issue:

ztest fails this assertion because ztest_dmu_read_write() does
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
dmu_object_set_checksum(os, bigobj,
(enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);

If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object. However, ztest does set_checksum(),
which must modify the dnode. The fix is for ztest to also call

dmu_tx_hold_bonus(tx, bigobj);

so we can account for the dirty data associated with setting the checksum

Illumos ZFS issues:
3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
+ delta <= tx->tx_space_towrite


254074 07-Aug-2013 delphij

MFV r254070:

Merge vendor bugfix for ZFS test suite that triggers false positives.

Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together


254012 06-Aug-2013 delphij

MFV r254011:

This change have no effect to FreeBSD but integrated for
completeness.

Illumos ZFS issues:
348 ZFS should handle DKIOCGMEDIAINFOEXT failure


253993 06-Aug-2013 mav

Block reporting of ZFS features for suspended pools.

Before executing any subcommand, zpool tool fetches pools configuration from
the kernel. Before features support was added, kernel was regenerating that
configuration based on data always present in memory. Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out. If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.

This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.


253992 06-Aug-2013 mav

Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition. Otherwise it queues free request as
it was before r252840.


253991 06-Aug-2013 mav

Make `zpool clear` to reopen also reconnected cache and spare devices.
Since `zpool status` reports about such kinds of errors, it is strange
that they are not cleared by `zpool clear`.


253990 06-Aug-2013 mav

Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.

In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.

Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.


253953 05-Aug-2013 attilio

Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by: EMC / Isilon storage division
Reported by: kib


253939 04-Aug-2013 attilio

The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho


253926 04-Aug-2013 smh

zfs_ioc_rename should not leave the value of zc_name passed in via zc altered
on return.

MFC after: 1 week


253821 30-Jul-2013 delphij

MFV r253783:

Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
3834 incremental replication of 'holey' file systems is slow

MFC after: 2 weeks


253820 30-Jul-2013 delphij

MFV r253782:

To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
3888 zfs recv -F should destroy any snapshots created since the
incremental source

MFC after: 2 weeks


253819 30-Jul-2013 delphij

MFV r253781 + r253871:

Illumos ZFS issues:
3894 zfs should not allow snapshot of inconsistent dataset

MFC after: 2 weeks


253816 30-Jul-2013 delphij

MFV r253780:

To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
3875 panic in zfs_root() after failed rollback

MFC after: 2 weeks


253806 30-Jul-2013 mav

Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.

This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness. There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.


253754 28-Jul-2013 mav

Partially close race between calls of orphan() method from GEOM and close()
method from ZFS core, that reliably causes use-after-free panic if SSD vdev
detached during inititial erase.


253643 25-Jul-2013 mav

Following r222950, revert unintentional change cls -> class in argument name
in r245264. Aside from non-uniformity, that again confused C++ compilers.


253606 24-Jul-2013 avg

zfs module: perform cleanup during shutdown in addition to module unload

- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook

This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.

MFC after: 20 days


253603 24-Jul-2013 avg

zfs: move vnode creation from zfs_znode_cache_constructor to zfs_znode_alloc

All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs

This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops

The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.

MFC after: 17 days


253441 18-Jul-2013 delphij

Manually merge part of vendor import r238583 from Illumos.

Illumos changeset: 13680:2bd022a765e2
Illumos ZFS issue:

2671 zpool import should not fail if vdev ashift has increased

MFC after: 3 days


253079 09-Jul-2013 avg

dtrace/fasttrap: install hook functions only after all data is
initialized

Sponsored by: HybridCluster
MFC after: 7 days


253073 09-Jul-2013 avg

zfs: try to properly handle i/o errors in mappedread_sf

Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller. The checks are picked up from
kern_sendfile.

MFC after: 3 weeks


253070 09-Jul-2013 avg

zfs: load zpool.cache after a root fs is mounted

MFC after: 3 weeks


252850 05-Jul-2013 markj

Hide references to mod_lock. In FreeBSD it is always acquired with the
provider lock held, so its use has no effect.


252840 05-Jul-2013 mm

MFV r252839:

Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).

By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.

Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case

MFC after: 1 week


252493 01-Jul-2013 markj

Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap
module. This should be MFCed with r250953.


252431 30-Jun-2013 rmh

Enable kernel-specific code for FreeBSD also on other systems that use
the kernel of FreeBSD.

Reviewed by: pjd


252390 29-Jun-2013 smh

Remove invalid ASSERT which causes a panic on zfs renames when run with ASSERTS.
Removal was missed in merge of illumos 3464 (r248571)

MFC after: 2 days


252380 29-Jun-2013 mm

Unbreak "zfs jail" and "zfs unjail" (broken since r248571)

I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.

These operations do not modify pools or datasets so there is no need to
log them to pool history.

Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after: 3 days


252337 28-Jun-2013 gavin

Don't try to re-insert an already present but invalid page.

This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range

This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel

Submitted by: avg
MFC after: 1 week


252219 25-Jun-2013 delphij

MFV r252215:

Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).

Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl

MFC after: 1 week


252060 21-Jun-2013 smh

Fix intermittent ZFS lock panic when kernel is compiled with debugging caused
by access of uninitialized smlock in mmutex_init.

MFC after: 1 week


252056 21-Jun-2013 smh

Fixed import of destroyed ZFS pools failing due to vdev_geom incorrectly
preventing config loads from devices associated with destroyed pools.

Reviewed by: avg
MFC after: 1 week


251646 12-Jun-2013 delphij

MFV r251644:

Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)

Illumos ZFS issues:
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing

MFC after: 2 weeks


251636 11-Jun-2013 delphij

MFV r251626:

ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems

MFC after: 2 weeks


251635 11-Jun-2013 delphij

MFV r251624:

txg commit callbacks don't work

Illumos ZFS issues:
3747 txg commit callbacks don't work

MFC after: 2 weeks


251633 11-Jun-2013 delphij

MFV r251622:

ZFS shouldn't ignore errors unmounting snapshots

Illumos ZFS issues:
3744 zfs shouldn't ignore errors unmounting snapshots

MFC after: 2 weeks


251632 11-Jun-2013 delphij

MFV r251621:

ZFS needs a refcount audit

Illumos ZFS issues:
3741 zfs needs a refcount audit

MFC after: 2 weeks


251631 11-Jun-2013 delphij

MFV r251620:

ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
3741 zfs comments need cleaner, more consistent style

MFC after: 2 weeks


251629 11-Jun-2013 delphij

MFV r251619:

ZFS needs better comments.

Illumos ZFS issues:
3741 zfs needs better comments

MFC after: 2 weeks


251520 08-Jun-2013 delphij

MFV r251519:

* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

ZFS should proactively evict freed blocks from the cache.

Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:

1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

MFC after: 2 weeks


251478 06-Jun-2013 delphij

MFV r251474:

* Illumos zfs issue #3137 L2ARC compression

Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.

The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.

MFC after: 2 weeks


250953 24-May-2013 markj

The fasttrap provider cleans up probes asynchronously when a process with
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.

This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.

Reviewed by: rpaulo
MFC after: 1 month


250574 12-May-2013 markj

Bring back part of r249367 by adding DTrace's temporal option, which allows
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line

#pragma D option temporal

to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).

This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.

The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.

This change corresponds to part of illumos-gate commit e5803b76927480:
3021 option for time-ordered output from dtrace(1M)

Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month


250149 01-May-2013 davide

In case ZFS doesn't use UMA for buffers there's no need to waste memory
creating zones that will remain empty.

Reviewed by: pjd


249921 26-Apr-2013 smh

Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled
Enabled ZFS TRIM by default

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


249858 24-Apr-2013 mm

MFV r249857:

Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.

Illumos ZFS issues:
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock

MFC after: 1 week


249787 23-Apr-2013 mm

The zfs synctask code restructuring introduced a new bug that makes it
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.

Illumos ZFS issues:
3739 cannot set zfs quota or reservation on pool version < 22

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reported by: Steve Wills <swills@FreeBSD.org>
MFC after: 3 days


249573 17-Apr-2013 pfg

DTrace: Revert r249367

The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.

Reference:
https://www.illumos.org/issues/3026

Reported by: Navdeep Parhar and Mark Johnston


249367 11-Apr-2013 pfg

DTrace: option for time-ordered output

Merge changes from illumos:

3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/

This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.

This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.

There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.

Special thanks to Fabian Keil for changes and testing.

Illumos Revisions: 13758:23432da34147

Reference:
https://www.illumos.org/issues/3021
https://www.illumos.org/issues/3022
https://www.illumos.org/issues/3023
https://www.illumos.org/issues/3024
https://www.illumos.org/issues/3025
https://www.illumos.org/issues/1694

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 months


249356 11-Apr-2013 mm

MFV r249354:
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.

Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream

MFC after: 8 days


249326 10-Apr-2013 mm

Cast to (void *)(uintptr_t) on copyout and copyin of zfs_iocparm_t.zfs_cmd

MFC after: 9 days


249319 09-Apr-2013 mm

ZFS expects a copyout of zfs_cmd_t on an ioctl error. Our sys_ioctl()
doesn't copyout in this case.

To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)

The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.

MFC after: 10 days


249209 06-Apr-2013 mm

MFV r249186:
Do not list read-only pools in zpool.cache
Reduce diff against vendor in unused vdev_disk.c

Illumos ZFS issues:
3639 zpool.cache should skip over readonly pools
3640 want automatic devid updates

MFC after: 1 week


249206 06-Apr-2013 mm

MFV r248660:
Merge vendor change - modify time processing in deadman thread.

Illumos ZFS issues:
3618 ::zio dcmd does not show timestamp data

MFC after: 3 weeks


249196 06-Apr-2013 mm

Provide a fix for kernel panic if receiving recursive deduplicated streams.
Problem reported to vendor.

Illumos ZFS issues:
3692 Panic on zfs receive of a recursive deduplicated stream

MFC after: 2 weeks


249195 06-Apr-2013 mm

MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
3598 want to dtrace when errors are generated in zfs

MFC after: 3 weeks


249188 06-Apr-2013 mm

MFV r242816:
Import vendor change to reduce diff, no effect on FreeBSD.

Illumos ZFS issues:
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd


249047 03-Apr-2013 avg

spa_open_common: fix argument to zvol_create_minors

Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.

MFC after: 7 days


249042 03-Apr-2013 mm

Fix possible pool hold leak in dmu_send_impl()

Problem reported to vendor:
https://www.illumos.org/issues/3645

Reported by: Andriy Gapon <avg@FreeBSD.org>
MFC after: 15 days


249004 02-Apr-2013 mm

Do not check against uninitialized rc and comment out vendor code

MFC after: 16 days


248983 01-Apr-2013 pfg

Dtrace: enablings on defunct providers prevent providers from unregistering

Merge change from illumos:

1368 enablings on defunct providers prevent providers from unregistering

We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.

Illumos Revision: r13430:8e6add739e38

Reference:
https://www.illumos.org/issues/1368

Reviewed by: gnn
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 3 weeks


248976 01-Apr-2013 mm

Call dmu_snapshot_list_next() in zvol.c with dsl_pool_config lock held

Submitted by: Andriy Gapon <avg@FreeBSD.org>
MFC after: 17 days


248752 26-Mar-2013 pfg

Dtrace: dtrace.c erroneously checks for memory alignment on amd64.

Merge change from illumos:

3511 dtrace.c erroneously checks for memory alignment on amd64

Illumos Revision: c93cc65

Reference:
https://www.illumos.org/issues/3511

Obtained from: Illumos
MFC after: 3 weeks


248708 25-Mar-2013 pfg

Dtrace: Add SUN MDB-like type-aware print() action.

Merge change from illumos:

1694 Add type-aware print() action

This is a very nice feature implemented in upstream Dtrace.
A complete description is available here:
http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/

This change bumps the DT_VERS_* number to 1.9.0 in
accordance to what is done in illumos.

While here also include some minor cleanups to ease further merging
and appease clang with a fix by Fabian Keil.

Illumos Revisions: 13501:c3a7090dbc16
13483:f413e6c5d297

Reference:
https://www.illumos.org/issues/1560
https://www.illumos.org/issues/1694

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248706 25-Mar-2013 pfg

Dtrace: add toupper()/tolower() and enhancements to lltostr().

Merge changes from illumos:

1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base

This change bumps the DT_VERS_* number to 1.8.1 in
accordance to what is done in illumos.

The test suite we currently include is outdated and
doesnt support some updates in tst.subr.d which had to
be left out for now.

Illumos Revisions: r13458 5e394d8db762
r13459 c3454574dd1a

Reference:
https://www.illumos.org/issues/1451
https://www.illumos.org/issues/1457

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248690 24-Mar-2013 pfg

Dtrace: add optional size argument to tracemem().

Merge change from illumos:

1455 DTrace tracemem() should take an optional size argument

Our local enhancements to dt_print_bytes were equivalent to
those in illumos but we made it match the illumos version
to ease further code merges.

For now leave out tst.smallsize.d and tst.smallsize.d.out
since those don't seem to work cleanly on FreeBSD.

This change bumps the DT_VERS_* number to 1.7.1 in accordance
to what is done in illumos.

Illumos Revision: 13457:571b0355c2e3

Reference:
https://www.illumos.org/issues/1455

Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month


248653 23-Mar-2013 will

ZFS: Fix a panic while unmounting a busy filesystem.

This particular scenario was easily reproduced using a NFS export. When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed. The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.

Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes. Simply #ifdef sun this check.

Submitted by: avg
Reviewed by: avg
Approved by: ken (mentor)
MFC after: 1 month


248602 21-Mar-2013 smh

Fix for building libzpool under i386.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248579 21-Mar-2013 smh

Add missing descriptions for ZFS sysctls

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248577 21-Mar-2013 smh

Optimisation of TRIM processing.

Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.

In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.

This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)

Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248576 21-Mar-2013 smh

Names the ZFS TRIM thread

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks


248575 21-Mar-2013 smh

TRIM cache devices based on time instead of TXGs.
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.

Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.

This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/17122c31ac7f82875e837019205c21651c05f8cd
MFC after: 2 weeks


248574 21-Mar-2013 smh

Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

- Free ZIOs are registered with the TXG from the ZIO itself, not the
current SPA syncing TXG (which may be out of date);
- L2ARC are registered with a zero TXG number, as L2ARC has no concept
of TXGs;
- The TXG limit for issuing TRIMs is now computed from the last synced
TXG, not the currently syncing TXG. Indeed, under extremely unlikely
race conditions, there is a risk we could trim blocks which have been
freed in a TXG that has not finished syncing, resulting in potential
data corruption in case of a crash.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/5b46ad40d9081d75505d6f3bf04ac652445df366
MFC after: 2 weeks


248573 21-Mar-2013 smh

Don't register repair writes in the trim map.

The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: Source: https://github.com/dechamps/zfs/commit/6a3cebaf7c5fcc92007280b5d403c15d0e61dfe3


248572 21-Mar-2013 smh

Add TRIM support for L2ARC.

This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc
MFC after: 2 weeks


248571 21-Mar-2013 mm

Merge libzfs_core branch:
includes MFV 238590, 238592, 247580

MFV 238590, 238592:
In the first zfs ioctl restructuring phase, the libzfs_core library was
introduced. It is a new thin library that wraps around kernel ioctl's.
The idea is to provide a forward-compatible way of dealing with new
features. Arguments are passed in nvlists and not random zfs_cmd fields,
new-style ioctls are logged to pool history using a new method of
history logging.

http://blog.delphix.com/matt/2012/01/17/the-future-of-libzfs/

MFV 247580 [1]:
To address issues of several deadlocks and race conditions the locking
code around dsl_dataset was rewritten and the interface to synctasks
was changed.

User-Visible Changes:
"zfs snapshot" can create more arbitrary snapshots at once (atomically)
"zfs destroy" destroys multiple snapshots at once
"zfs recv" has improved performance

Backward Compatibility:
I have extended the compatibility layer to support full backward
compatibility by remapping or rewriting the responsible ioctl arguments.
Old utilities are fully supported by the new kernel module.

Forward Compatibility:
New utilities work with old kernels with the following restrictions:
- creating, destroying, holding and releasing of multiple snapshots
at once is not supported, this includes recursive (-r) commands

Illumos ZFS issues:
2882 implement libzfs_core
2900 "zfs snapshot" should be able to create multiple,
arbitrary snapshots at once
3464 zfs synctask code needs restructuring

References:
https://www.illumos.org/issues/2882
https://www.illumos.org/issues/2900
https://www.illumos.org/issues/3464 [1]

MFC after: 1 month
Sponsored by: Hybrid Logic Inc. [1]


248493 19-Mar-2013 mm

Plug memory leak in dsl_check_snap_cb()
This was unnoticed because the function is very rarely used.

MFC after: 3 days


248457 18-Mar-2013 jhibbits

Add FBT for PowerPC DTrace. Also, clean up the DTrace assembly code,
much of which is not necessary for PowerPC.

The FBT module can likely be factored into 3 separate files: common,
intel, and powerpc, rather than duplicating most of the code between
the x86 and PowerPC flavors.

All DTrace modules for PowerPC will be MFC'd together once Fasttrap is
completed.


248426 17-Mar-2013 mm

Fix typo in sysctl description

Reported by: Jeremy Chadwick
MFC after: 3 days


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


247852 05-Mar-2013 mm

MFV r247845:
Import ZFS bpobj bugfix from vendor.

Illumos ZFS issues:
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely

References:
https://www.illumos.org/issues/3603
https://www.illumos.org/issues/3604

MFC after: 1 week


247820 04-Mar-2013 gibbs

Fix assertion failure when using userland DTrace probes from
the pid provider on a kernel compiled with INVARIANTS.

sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
In fasttrap_probe_pid(), attempts to write to the
address space of the thread that fired the probe
must be performed with the process of the thread
held. Use _PHOLD() to ensure this is the case.

In fasttrap_probe_pid(), use proc_write_regs() instead
of calling set_regs() directly. proc_write_regs()
performs invariant checks to verify the calling
environment of set_regs(). PROC_LOCK()/UNLOCK() around
the call to proc_write_regs() so that it's invariants
are satisfied.

Sponsored by: Spectra Logic Corporation
Reviewed by: gnn, rpaulo
MFC after: 1 week


247602 02-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


247592 01-Mar-2013 delphij

MFV r247575:

Import a fix tighten assertion on SPA versions from vendor (Illumos).

Illumos ZFS issue:

3543 Feature flags causes assertion in spa.c to miss certain cases

MFC after: 2 weeks


247585 01-Mar-2013 mm

MFV r247316:
Merge new read-only zfs properties from vendor (illumos)

Illumos ZFS issues:
3588 provide zfs properties for logical (uncompressed) space used and
referenced

References:
https://www.illumos.org/issues/3588

MFC after: 2 weeks


247540 01-Mar-2013 mm

Fix the zfs_ioctl compat layer to support zfs_cmd size change introduced
in r247265 (ZFS deadman thread). Both new utilities now support the old
kernel and new kernel properly detects old utilities.

For future backwards compatibility, the vfs.zfs.version.ioctl read-only
sysctl has been introduced. With this sysctl zfs utilities will be able
to detect the ioctl interface version of the currently loaded zfs module.

As a side effect, the zfs utilities between r247265 and this revision don't
support the old kernel module. If you are using HEAD newer or equal than
r247265, install the new kernel module (or whole kernel) first.

MFC after: 10 days


247398 27-Feb-2013 mm

MFV 247176, 247178, 247315:
Import metaslab_sync() speedup from vendor (illumos).

Illumos ZFS issues:
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not
condensing)
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()

References:
https://www.illumos.org/issues/3552
https://www.illumos.org/issues/3564
https://www.illumos.org/issues/3578
https://www.illumos.org/issues/3579

MFC after: 2 weeks


247348 26-Feb-2013 mm

Be more verbose on ZFS deadman I/O panic
Patch suggested upstream.

Suggested by: Olivier Cinquin
MFC after: 12 days


247265 25-Feb-2013 mm

MFV v242732:

Merge the ZFS I/O deadman thread from vendor (illumos).
This feature panics the system on hanging ZFS I/O, helps debugging
and resumes failed service.

The panic behavior can be controlled with the loader-only tunables:
vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O)
vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)

By default, ZFS I/O deadman is enabled by default on amd64 and i386
excluding virtual guest machines.

Illumos ZFS issues:
3246 ZFS I/O deadman thread

References:
https://www.illumos.org/issues/3246

MFC after: 2 weeks


247187 23-Feb-2013 mm

MFV r246653:
Import vendor change to avoid "unitialized variable" warnings.

Illumos ZFS issues:
3522 zfs module should not allow uninitialized variables

References:
https://www.illumos.org/issues/3522


247049 20-Feb-2013 gibbs

Avoid panic when tearing down the DTrace pid provider for a
process that has crashed.

sys/cddl/contrib/opensolaris/uts/common/dtrace/fasttrap.c:
In fasttrap_pid_disable(), we cannot PHOLD the proc
structure for a process that no longer exists, but
we still have other, fasttrap specific, state that
must be cleaned up for probes that existed in the
dead process. Instead of returning early if the
process related to our probes isn't found,
conditionalize the locking and carry on with a NULL
proc pointer. The rest of the fasttrap code already
understands that a NULL proc is possible and does
the right things in this case.

Sponsored by: Spectra Logic Corporation
Reviewed by: rpaulo, gnn
MFC after: 1 week


246808 14-Feb-2013 delphij

Eliminate real_LZ4_uncompress. It's unused and does not perform sufficient
check against input stream (i.e. it could read beyond specified input
buffer).


246773 13-Feb-2013 mm

Change vfs.zfs.write_to_degraded from CTLFLAG_RW to CTLFLAG_RWTUN

Suggested by: pjd


246768 13-Feb-2013 delphij

Restore De Bruijn algorithm for sparc64 where the compiler rely on a
library function for __builtin_c?z.

Tested by: Michael Moll <kvedulv kvedulv de>


246688 11-Feb-2013 mm

Merge zfs_ioctl.c code that should have been merged together with ZFS v28.
Fixes several problems if working with read-only pools.

Changed code originaly introduced in onnv-gate 13061:bda0decf867b
Contains changes up to illumos-gate 13700:4bc0783f6064

PR: kern/175897
Suggested by: avg

MFC after: 2 weeks


246678 11-Feb-2013 mm

MFV r246633:
Import vendor bugfixes regarding SA rounding, header size and layout.
This was already partially fixed by avg.

Illumos ZFS issues:
3512 rounding discrepancy in sa_find_sizes()
3513 mismatch between SA header size and layout

References:
https://www.illumos.org/issues/3512
https://www.illumos.org/issues/3513

MFC after: 2 weeks


246675 11-Feb-2013 mm

MFV r246394:
Add tunable to allow block allocation on degraded vdevs.

Illumos ZFS issues:
3507 Tunable to allow block allocation even on degraded vdevs

References:
https://www.illumos.org/issues/3507

MFC after: 2 weeks


246666 11-Feb-2013 mm

MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
https://www.illumos.org/issues/3498

MFC after: 2 weeks


246651 11-Feb-2013 mm

MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after: 2 weeks


246631 10-Feb-2013 mm

MFV r246388:

Import vendor bugfixes

Illumos ZFS issues:
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

References:
https://www.illumos.org/issues/3422
https://www.illumos.org/issues/3425

MFC after: 2 weeks


246586 09-Feb-2013 delphij

MFV r245512:

* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible. This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

- On-stack hash table disabled and using kernel slab allocator
instead, at this time. This requires larger kernel thread stack
for zio workers. This may change in the future should we adjusted
the zio workers' thread stack size.
- likely and unlikely will be undefined if they are already defined,
this is required for i386 XEN build.
- Removed De Bruijn sequence based __builtin_ctz family of builtins
in favor of the latter. Both GCC and clang supports these builtins.
- Changed the way the LZ4 code detects endianness.
- Manual pages modifications to mention the feature based on Illumos
counterpart.
- Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from: Illumos (13921:9d721847e469)
Tested on: FreeBSD/amd64
MFC after: 1 month


246532 08-Feb-2013 avg

zfs_vget, zfs_fhtovp: properly handle the z_shares_dir object

A special gfs vnode corresponds to that object.
A regular zfs vnode must not be returned.

This should be upstreamed.

Reported by: pluknet
Submitted by: rmacklem
Tested by: pluknet
MFC after: 10 days


246531 08-Feb-2013 avg

zfs: update comments about zfid_long_t to match the FreeBSD definitions

MFC after: 1 week


246293 03-Feb-2013 avg

zfs: fix, improve and re-organize page_lookup and page_unlock

Now they are split into two pairs: page_hold/page_unhold for mappedread
and page_busy/page_unbusy for update_pages.

For mappedread we simply hold a page that is to be used as a source if it
is resident and valid (and not busy). This is sufficient since we are
only doing page -> user buffer copying. There is no page <-> backing
storage I/O involved.

update_pages is now better split to properly handle the putpages case
(page -> arc) and the regular write case (arc -> page).

For the latter we use complete protocol of marking an object with
paging-in-progress and marking a page with io_start (busy count).
Also, in this case we remove the write bit from all page mappings and
clear dirty bits of the pages, the former is needed to ensure that the
latter does the right thing.
Additionally we update a page if it is cached instead of just freeing it
as was done before. This needs to be verified.

A minor detail: ZFS-backed pages should always be either fully valid
or fully invalid. Assert this and use simpler API that does not deal
with sub-page blocks.

Reviewed by: kib
MFC after: 26 days


246242 02-Feb-2013 avg

zfs: add MODULE_VERSION for zfsctrl

This should allow the kernel linker to easily detect a situation
when the module is present both in a kernel and in a preloaded file
(zfs.ko).

Reviewed by: jhb
MFC after: 5 days


245945 26-Jan-2013 avg

spa_generate_rootconf: add support for old vdev labels

It seems that old ZFS versions (v15) completely omit "vdev_children"
property when there is a single child.

Reported by: jase
Tested by: jase
MFC after: 1 week


245511 16-Jan-2013 delphij

MFV r245510:

improve the comment in txg.c

Obtained from: Illumos (13910:f3454e0a097c)
MFC after: 2 weeks


245409 14-Jan-2013 kib

For zfs vnodes, use the standard inode number based hash algorithm.

Reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


245264 10-Jan-2013 delphij

The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors. However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately. With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from: ZFS on Linux github (e8fd45a0f975c6b8ae8cd644714fc21f14fac2bf)
MFC after: 1 month


244635 23-Dec-2012 avg

zfs: solaris doesn't have KM_ZERO, kmem_zalloc should be used instead

To do: remove KM_ZERO declaration
Pointyhat to: avg (for mindlessly using the pseudo-flag)
MFC after: instantly (to fix stable/8 build)


244188 13-Dec-2012 smh

Added vfs.zfs.vdev.trim_on_init sysctl which allows full vdev trim on
initialisation to be enabled (1) / disabled (0) defaults to enabled.

This is useful for devices which have a slow trim speed and are either
new or have otherwise already been wiped e.g. secure erase.

PR: kern/173116
Submitted by: Steven Hartland
Approved by: pjd (mentor)


244187 13-Dec-2012 smh

Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR: kern/173254
Submitted by: Steven Hartland
Approved by: pjd (mentor)


244155 12-Dec-2012 smh

Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR: kern/173113
Submitted by: Steven Hartland
Reviewed by: pjd
Approved by: pjd
MFC after: 2 weeks


243807 03-Dec-2012 delphij

Use SA_ZPL_CRTIME instead of SA_ZPL_CTIME for creation time.

Submitted by: phil.stone at gmx.com
MFC after: 2 weeks


243763 01-Dec-2012 avg

zfs_getpages: make use of vm_page_readahead_finish

Suggested by: kib
MFC after: 5 days


243762 01-Dec-2012 avg

gfs_file_inactive: replace bad code with ugly code

Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR: kern/151111
Reviewed by: kib
MFC after: 13 days


243560 26-Nov-2012 mm

MFV r243395:

Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after: 2 weeks


243525 25-Nov-2012 mm

Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after: 2 weeks


243524 25-Nov-2012 mm

MFV r243013 and r243267:

Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after: 2 weeks


243521 25-Nov-2012 avg

zfs_freebsd_reclaim: remove a stray variable

... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after: 5 weeks
X-MFC with: r243520


243520 25-Nov-2012 avg

zfs: overhaul zfs-vfs glue for vnode life-cycle management

* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model. All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit. This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim. This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start. This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested. Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 6 weeks


243519 25-Nov-2012 avg

zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here

MFC after: 12 days


243518 25-Nov-2012 avg

add zfs_bmap to aid vnode_pager_haspage

... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block. Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk. The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed. vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem. So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by: kib
MFC after: 6 weeks


243517 25-Nov-2012 avg

zfs_getpages: optimize for large block sizes

MFC after: 6 weeks


243505 25-Nov-2012 mm

MFV r243012:

Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version

References:
https://www.illumos.org/issues/3349

MFC after: 1 week


243503 25-Nov-2012 mm

MFV r242735:

Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after: 2 weeks


243502 24-Nov-2012 avg

zfs roopool: add support for multi-vdev configurations

Tested by: madpilot
MFC after: 10 days


243501 24-Nov-2012 avg

spa_import_rootpool: initialize ub_version before calling spa_config_parse

... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by: Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by: Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after: 6 days


243500 24-Nov-2012 avg

spa_import_rootpool: do not call spa_history_log_version

The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after: 6 days


243497 24-Nov-2012 avg

zfs: create devices/geoms from zvols after receiveing them

PR: kern/167066
Tested by: Andreas Nilsson <andrnils@gmail.com>
MFC after: 13 days


243270 19-Nov-2012 avg

zfs_remove: assert that delete_now case is never true on FreeBSD

That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with: pjd
MFC after: 12 days


243268 19-Nov-2012 avg

zfs_remove: set VV_NOSYNC flag if a node is unlinked

Suggested by: kib
MFC after: 12 days


243213 18-Nov-2012 avg

spa_import_rootpool: fall back to use configuration from zpool.cache...

if we fail to generate a proper root pool configuration based on disk
probing. Currently we can not properly generate the configuration for
multi-vdev pools. Make that explicit.

Reported by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after: 4 days


242958 13-Nov-2012 kib

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


242862 10-Nov-2012 avg

zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots

... before trying to destroy the zvol snapshots themselves.

PR: kern/173442
Reported by: Petri Helenius <petri@helenius.fi>,
mm
Obtained from: Brian Behlendorf <behlendorf1@llnl.gov>,
Illumos Bug #3170
Tested by: Petri Helenius <petri@helenius.fi>
MFC after: 10 days


242845 10-Nov-2012 delphij

MFV r242729 (mm):

Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable


242833 09-Nov-2012 attilio

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


242723 07-Nov-2012 jhibbits

Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.

There is one known issue: Some probes will display an error message along the
lines of: "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit. I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded. Volunteers are welcome.

MFC after: 1 month


242575 04-Nov-2012 avg

zfs_dirlook: bailout early if directory is unlinked

Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with: pjd
Also see: http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after: 26 days


242574 04-Nov-2012 avg

zfsctl_snapdir_lookup: obtain a snapname in the remount case

... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
MFC after: 20 days


242573 04-Nov-2012 avg

zfs: set MNTK_EXTENDED_SHARED flag

Discussed with: kib
MFC after: 20 days


242571 04-Nov-2012 avg

zfs_vnode_forget: dispose of larvae vnode using public vfs api (mostly)

Reviewed by: kib
MFC after: 19 days


242570 04-Nov-2012 avg

zfs_umount: no need to set MNTK_UNMOUNTF here, dounmount handles that

Reviewed by: kib
MFC after: 19 days


242568 04-Nov-2012 avg

zfs_vnode_lock: no need to double-guess caller's intentions here

vn_lock should do the right thing with respect to given vnode lock
flags. If a caller doesn't mind a doomed vnode, then zfs should deliver.

Reviewed by: kib
MFC after: 19 days


242567 04-Nov-2012 avg

zfs_mount: drop vfs.zfs.rootpool.prefer_cached_config tunable

It turned out to be not that useful, because its default value may lead
to a problem when a root pool is present in zpool.cache, but its
on-disk status is 'exported'. This may happen if the pool was imported
in a different environment with -f flag and then exported.

MFC after: 12 days


242566 04-Nov-2012 avg

zfs_freebsd_close: call zfs_close with count=1 instead of count=0

Otherwise we may be leaking z_sync_cnt, which may lead to unnecessary
ZIL sync-ing.

MFC after: 12 days


242332 30-Oct-2012 delphij

s/dettach/detach/g

Approved by: pjd
MFC after: 1 month


242135 26-Oct-2012 avg

zfs: fix label validation code in vdev_geom_read_config

POOL_STATE_SPARE and POOL_STATE_L2CACHE were not handled correctly
and thus the cache and spare disks would not be correctly probed.

Reported by: Michael Schmiedgen <schmiedgen@gmx.net>,
Matthew D. Fuller <fullermd@over-yonder.net>
Tested by: Michael Schmiedgen <schmiedgen@gmx.net>,
flo
MFC after: 5 days


241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


241773 20-Oct-2012 avg

zfs: wait in arc_lowmem only if curproc == pageproc

... otherwise the current thread might be holding ARC locks and thus run
into a deadlock. This happens, for example, when a thread does memory
allocation in the ARC code and runs into KVA shortage.
Also, it really makes the most sense to wait in pageproc, so that the
results of ARC reclamation are seen before the page cache is acted upon.
In other cases where vm_lowmem is invoked, e.g. on KVA space shortage,
the callers perform multiple attempts (up to 8) and wait for rather
long intervals between them (up to 4 seconds), so ARC reclaim results
should become visible even without explicit waiting on the ARC thread.

Note that this is not a critical issue for typical ZFS usages where KVA
space should already be large enough. On amd64 systems setting KVA size
to twice the physical memory size is known to mitigate KVA fragmentation
issues in practice.

Side note: perhaps vm_lowmem 'how' parameter should be used to
differentiate between causes of the event.

Reported by: Nikolay Denev <ndenev@gmail.com>
MFC after: 19 days


241628 17-Oct-2012 avg

zfs: make use of getnewvnode_reserve in zfs_mknode and zfs_zget

getnewvnode_reserve helps to avoid "recursing" back into zfs code
via getnewvnode when that latter needs to reclaim some vnodes.
zfs code may hold a number of locks around getnewvnode and doesn't
expect any recursion to happen on those locks, because that never
happens in solaris.

I believe that this change also eleiminates a need for the delayed
znode destruction via the taskqueue.

Many thanks to kib for devising getnewvnode_reserve.

Reported by: flo
Tested by: bapt, kwm, swills
MFC after: 2 weeks
X-MFC after: r241556


241394 10-Oct-2012 kevlo

Revert previous commit...

Pointyhat to: kevlo (myself)


241370 09-Oct-2012 kevlo

Prefer NULL over 0 for pointers


241297 06-Oct-2012 avg

zvol: set mediasize in geom provider right upon its creation

... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened. Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.

MFC after: 14 days


241286 06-Oct-2012 avg

zfs_mount: taste geom providers for root pool config

This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.

If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.

Discussed with: gibbs, pjd, des
MFC after: 25 days


240955 26-Sep-2012 mm

Merge recent vendor changes in ZFS.

Illumos issued covered:
2811 missing implementation: zfs send -r
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting

References:
https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


240870 23-Sep-2012 pjd

It is possible to recursively destroy snapshots even if the snapshot
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:

tank
tank/foo
tank/foo@snap
tank/bar
tank/bar@snap

We can execute:

# zfs destroy -t tank@snap

eventhough tank@snap doesn't exit.

Unfortunately it is not possible to do the same with recursive rename:

# zfs rename -r tank@snap tank@pans
cannot open 'tank@snap': dataset does not exist

...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.

Sponsored by: rsync.net
MFC after: 2 weeks


240868 23-Sep-2012 pjd

Add TRIM support.

The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by: multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from: zfsonlinux
Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>


240831 22-Sep-2012 avg

zfs: allow a zvol to be used as a pool vdev, again

Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.

Reviewed by: pjd
MFC after: 24 days


240829 22-Sep-2012 pjd

As in r226967, r226987 and r232401 changes to UFS and TMPFS remove cache
entries associated with the source and the target of rename().

MFC after: 1 week


240632 18-Sep-2012 avg

zfs: correctly calculate dn_bonuslen for saving SAs to disk

Since all attribute values start at 8-byte aligned boundary, we would
previously incorrectly calculate dn_bonuslen if any attribute but the
last had a variable-length value with length not multiple of 8.

Reported by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Tested by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> (for upstream)
MFC after: 2 weeks


240631 18-Sep-2012 avg

zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line

Discussed with: pjd
MFC after: 10 days


240415 12-Sep-2012 mm

Merge recent zfs vendor changes, sync code and adjust userland DEBUG.

Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

References:
https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


240345 11-Sep-2012 avg

zfs: fix sa_modify_attrs handling of variable-sized attributes

- skip length_idx index for a replaced variable-sized attribute
- skip length_idx index for a removed variable-sized attribute
- also re-arranged code to make sure that length_idx is always
incremented for variable-sized attributes
- additionally add an assertion that the number of actually produced
attributes is the same as the expected number of resulting
attributes

In cooperation with: Matthew Ahrens <mahrens@delphix.com>
Tested by: Trent Nelson <trent@snakebite.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> (for upstream)
To do: get this upstreamed
MFC after: 2 weeks


240303 10-Sep-2012 mm

Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.

MFC after: 2 weeks


240133 05-Sep-2012 mm

Merge recent vendor changes and sync code:
1862 incremental zfs receive fails for sparse file > 8PB
3112 ztest does not honor ZFS_DEBUG
3122 zfs destroy filesystem should prefetch blocks
3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

References:
https://www.illumos.org/issues/1862
https://www.illumos.org/issues/3112
https://www.illumos.org/issues/3122
https://www.illumos.org/issues/3129
https://www.illumos.org/issues/3130

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239786 28-Aug-2012 ed

Use a proper destructor function.

When calling a revoke(2) on a dtrace device, dtrace_close() could be
called, even if threads are still stuck in the device. Defer the actual
deallocation of datastructures to the cdevpriv destructor.

While there, remove the unneeded D_TRACKCLOSE and D_NEEDMINOR flags. For
the helper device, we never need it. For the regular dtrace devices, we
only need these flags on FreeBSD pre-8.

MFC after: 1 month


239774 28-Aug-2012 mm

Merge recent vendor changes:
3100 zvol rename fails with EBUSY when dirty
3104 eliminate empty bpobjs
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

References:
https://www.illumos.org/issues/3100
https://www.illumos.org/issues/3104
https://www.illumos.org/issues/3120

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks


239620 23-Aug-2012 mm

Merge recent vendor changes:
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Referenes:
https://www.illumos.org/issues/3086
https://www.illumos.org/issues/3090
https://www.illumos.org/issues/3102

PR: kern/170912, kern/170914
Obtained from: illumos (changeset #13776, #13777)
MFC after: 2 weeks


239389 19-Aug-2012 mm

Backport fix for vendor issue #3085
3085 zfs diff panics, then panics in a loop on booting

References:
https://www.illumos.org/issues/3085

PR: kern/170763
Obtained from: ssh://anonhg@hg.illumos.org/illumos-gate (r13772)
MFC after: 1 week


239303 15-Aug-2012 hselasky

Streamline use of cdevpriv and correct some corner cases.

1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)

2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.

3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.

Discussed with: phk
MFC after: 2 weeks


239077 05-Aug-2012 marius

Include <vm/vm_param.h> for PA_LOCK_COUNT in order to fix kernel build
with options ZFS after r239065.


238926 30-Jul-2012 mm

Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags

References:
https://www.illumos.org/issues/2762

MFC after: 2 weeks


238656 20-Jul-2012 trasz

Make ZVOL resizing ('zfs set volsize') properly resize the GEOM provider.

Sponsored by: FreeBSD Foundation


238113 04-Jul-2012 pjd

vdev_io_done stage is not used for ioctls.

MFC after: 1 week


237972 02-Jul-2012 mm

Expose scrub and resilver tunables.
This allows the user to tune the priority trade-off between scrub/resilver
and other ZFS I/O.

MFC after: 2 weeks
Discussed with: pjd


237817 29-Jun-2012 pfg

Bump dtrace_helper_actions_max from 32 to 128

Dave Pacheco from Joyent (and Dtrace.org) bumped the cap to 1024 but,
according to his blog, 128 is the recommended minimum.

For now bump it safely to 128 although we may have to bump it further
if there is demand in the future.

Reference:

http://www.illumos.org/issues/2558
http://dtrace.org/blogs/dap/2012/01/50/where-does-your-node-program-spend-its-time/


237624 27-Jun-2012 pfg

Bring llquantize support into Dtrace.

Bryan Cantrill implemented the equivalent of semi-log graph
paper for Dtrace so llquantize will use one logarithmic and
one linear scale.

Special thanks to Mark Peek for providing fix to an
assertion and to Fabian Keill for testing the port.

Illumos Revision: 13355:15b74a2a9a9d

Reference:
https://www.illumos/issues/905

Obtained from: Illumos
Tested by: Fabian Keill, mp
MFC after: 4 days


237458 22-Jun-2012 mm

Import Illumos revision 13736:9f1d48e1681f
2901 ZFS receive fails for exabyte sparse files

References:
https://www.illumos.org/issues/2901

Obtained from: illumos (issue #2901)
MFC after: 1 week


236884 11-Jun-2012 mm

Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags

References:
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747

Obtained from: illumos (issue #2619, #2747)
MFC after: 1 month


236823 09-Jun-2012 pjd

ds_guid of 0 is special, as it is used by snapshot receive code to
differentiate between an incremental and full stream.
Be sure not to generate guid equal to 0.

Reported by: someone who saw 0 being generated as 64bit random guid
MFC after: 3 days


236250 29-May-2012 pjd

Tighten up the assertion: because size can't be 0 and even if sm_space is equal
to sm_size, any 'sm_space - size' will be less than sm_size.

MFC after: 3 days


236249 29-May-2012 pjd

Eliminate 'where' argument, we don't use it.

MFC after: 3 days


236248 29-May-2012 pjd

Remove unused variable.

MFC after: 3 days


236247 29-May-2012 pjd

Remove unused sysctl.

MFC after: 3 days


236155 27-May-2012 mm

Import illumos changeset 13570:3411fd5f1589
1948 zpool list should show more detailed pool information

Display per-vdev information with "zpool list -v".
The added expandsize property has currently no value on FreeBSD.
This changeset allows adding expansion support to individual vdevs
in the future.

References:
https://www.illumos.org/issues/1948

Obtained from: illumos (issue #1948)
MFC after: 2 weeks


236146 27-May-2012 mm

Import illumos changeset 13605:b5c2b5db80d6 (partial)
763 FMD msg URLs should refer to something visible

Replace sun.com URL's with illumos.org

References:
https://www.illumos.org/issues/763

Obtained from: illumos (issue #763)
MFC after: 1 week


235781 22-May-2012 trasz

Fix enforcement of file size limit with O_APPEND on ZFS.

vn_rlimit_fsize takes uio->uio_offset and uio->uio_resid into account
when determining whether given write would exceed RLIMIT_FSIZE.

When APPEND flag is specified, ZFS updates uio->uio_offset to point to the
end of file.

But this happens after a call to vn_rlimit_fsize, so vn_rlimit_fsize check
can be rendered ineffective by thread that opens some file with O_APPEND
and lseeks below RLIMIT_FSIZE before calling write.

Submitted by: Mateusz Guzik <mjguzik at gmail dot com>
MFC after: 2 weeks


235222 10-May-2012 mm

Import illumos changeset 13686:4bc0783f6064
2703 add mechanism to report ZFS send progress

If the zfs send command is used with the -v flag, the amount of bytes
transmitted is reported in per second updates.

References:
https://www.illumos.org/issues/2703

Obtained from: illumos (issue #2703)
MFC after: 2 weeks


234795 29-Apr-2012 marius

Partially revert r232938; ZFS only requires nfs4 but not posix1e.

Submitted by: jhb


234691 26-Apr-2012 rstone

Implement the D "cpu" variable, which returns curcpu. I have chosen not
to follow the example of OpenSolaris and its descendants, which implemented
cpu as an inline that took a value out of curthread. At certain points in
the FreeBSD scheduler curthread->td_oncpu will no longer be valid (in
particukar, just before the thread gets descheduled) so instead I have
implemented this as its own built-in variable.

Sponsored by: Sandvine Inc.
MFC after: 1 week


234607 23-Apr-2012 trasz

Remove unused thread argument to vrecycle().

Reviewed by: kib


234064 09-Apr-2012 attilio

- Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039


233918 05-Apr-2012 avg

zfs_ioctl: no need for ddi_copyin/out here because sys_ioctl handles that

On FreeBSD the direct ioctl argument is automatically copied in/out
as necesary by the kernel ioctl entry point.

PR: kern/164445
Submitted by: Luis Garces-Erice <lge@ieee.org>
Tested by: Attila Nagy <bra@fsn.hu>
MFC after: 5 days


233408 24-Mar-2012 gonzo

Add MIPS support to cddl/contrib part:

- header and stub .c file for fasttrap module. It's not supported on
MIPS yet, but there is no way to disable support completely
- Do as amd64 trying to limit allocated memory


232938 13-Mar-2012 adrian

Add dependencies onto acl_posix1e and acl_nfs4.


232186 26-Feb-2012 mm

Analogous to r232059, add a parameter for the ZFS file system:

allow.mount.zfs:
allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers

MFC after: 10 days


230945 03-Feb-2012 mm

Revert r230913 and r230914.

The initialization was correct, the problem needs deeper analysis.


230914 02-Feb-2012 mm

Add copyright information on last commits to comply with CDDL.

Discussed with: pluknet@
MFC after: 3 days


230913 02-Feb-2012 mm

Fix out of bounds write causing random panics,
uncovered by the change in r230256

Reviewed by: pluknet@
MFC after: 3 days


230689 29-Jan-2012 kmacy

always exclude data bufs regardless of debug settings


230647 28-Jan-2012 kmacy

add tunable for developers working on areas outside of ZFS
to further reduce core size by excluding ARC metadata buffers
from core dumps


230623 27-Jan-2012 kmacy

exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by: alc, avg
MFC after: 2 weeks


230514 24-Jan-2012 mm

Merge illumos revisions 13572, 13573, 13574:

Rev. 13572:
disk sync write perf regression when slog is used post oi_148 [1]

Rev. 13573:
crash during reguid causes stale config [2]
allow and unallow missing from zpool history since removal of pyzfs [5]

Rev. 13574:
leaking a vdev when removing an l2cache device [3]
memory leak when adding a file-based l2arc device [4]
leak in ZFS from metaslab_group_create and zfs_ereport_checksum [6]

References:
https://www.illumos.org/issues/1909 [1]
https://www.illumos.org/issues/1949 [2]
https://www.illumos.org/issues/1951 [3]
https://www.illumos.org/issues/1952 [4]
https://www.illumos.org/issues/1953 [5]
https://www.illumos.org/issues/1954 [6]

Obtained from: illumos (issues #1909, #1949, #1951, #1952, #1953, #1954)
MFC after: 2 weeks


230438 21-Jan-2012 pjd

Dramatically optimize listing snapshots when user requests only snapshot
names and wants to sort them by name, ie. when executes:

# zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot properties.

Below you can find how long does it take to list 34509 snapshots from a single
disk pool before and after this change with cold and warm cache:

before:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s

after:

# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s

MFC after: 1 week


230397 20-Jan-2012 pjd

By default turn off prefetch when listing snapshots.
In my tests it makes listing snapshots 19% faster with cold cache and
47% faster with warm cache.

MFC after: 1 week


230256 17-Jan-2012 pluknet

Fix the "lock &zrl->zr_mtx already initialized" assertion by initializing
the allocated memory before calling mtx_init(9) on mtx pointing to it.
Otherwize, random contents of uninitialized memory might occasionally
trigger the assertion.

Reported by: Pavel Polyakov <bsd kobyla org>
Reviewed by: pjd
MFC after: 1 week


229663 05-Jan-2012 pjd

- Allow to change vfs.zfs.arc_meta_limit at runtime.
- Change vfs.zfs.arc_meta_used from CTLFLAG_RDTUN to CTLFLAG_RD, as it is
not a tunable.

MFC after: 3 days


229425 03-Jan-2012 dim

In sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c, check the
the number of links against LINK_MAX (which is INT16_MAX), not against
UINT32_MAX. Otherwise, the constant would implicitly be converted to
-1.

Reviewed by: pjd
MFC after: 1 week


228686 18-Dec-2011 pjd

From time to time people report space map corruption resulting in panic
(ss == NULL) on pool import. I had such a panic recently. With current version
of ZFS it is still possible to import the pool in readonly mode and backup
all the data, but in case it is impossible for some reason add tunable
vfs.zfs.space_map_last_hope, which when set to '1' will tell ZFS to remove
colliding range and retry. This seems to have worked for me, but I consider
it highly risky to use.

MFC after: 1 week


228685 18-Dec-2011 pjd

Implement replying of ACLs updates. ACL changes should go to ZIL only
if the 'sync' property is set to 'always', so replying them is not common.

MFC after: 1 month


228448 12-Dec-2011 attilio

Revert the approach for skipping lockstat_probe_func call when doing
lock_success/lock_failure, introduced in r228424, by directly skipping
in dtrace_probe.

This mainly helps in avoiding namespace pollution and thus lockstat.h
dependency by systm.h.

As an added bonus, this also helps in MFC case.
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424


228392 10-Dec-2011 pjd

Move ru_inblock increment into arc_read_nolock() so we don't account for
cached reads.

Discussed with: gibbs
No objections from: avg
Tested by: Marcus Reid <marcus@blazingdot.com>
MFC after: 1 week


228363 09-Dec-2011 pjd

The vfs.zfs.txg.timeout sysctl can be safely modified at run time.

MFC after: 1 week


228104 28-Nov-2011 mm

Fix typo in copyright notice.

MFC after: 1 month


228103 28-Nov-2011 mm

Merge new ZFS features from illumos:

1644 add ZFS "clones" property
https://www.illumos.org/issues/1644

1645 add ZFS "written" and "written@..." properties
https://www.illumos.org/issues/1645

1646 "zfs send" should estimate size of stream
https://www.illumos.org/issues/1646

1647 "zfs destroy" should determine space reclaimed by destroying multiple
snapshots
https://www.illumos.org/issues/1647

1693 persistent 'comment' field for a zpool
https://www.illumos.org/issues/1693

1708 adjust size of zpool history data
https://www.illumos.org/issues/1708

1748 desire support for reguid in zfs
https://www.illumos.org/issues/1748

Obtained from: illumos (changesets 13514, 13524, 13525)
MFC after: 1 month


227697 19-Nov-2011 kib

Existing VOP_VPTOCNP() interface has a fatal flow that is critical for
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.

Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.

Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.

Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)


227291 07-Nov-2011 rstone

Replace fasttrap_copyout() with uwrite(). FreeBSD copyout() is not able to
write to the .text section of a process.

Obtained from: rpaulo
MFC after: 3 days


227111 05-Nov-2011 pjd

Correct typo in comment.

Reported by: Fabian Keil <fk@fabiankeil.de>
MFC after: 3 days


227110 05-Nov-2011 pjd

In zvol_open() if the spa_namespace_lock is already held, it means that
ZFS is trying to open and taste ZVOL as its VDEV. This is not supported,
so return an error instead of panicing on spa_namespace_lock recursion.

Reported by: Robert Millan <rmh@debian.org>
PR: kern/162008
MFC after: 3 days


226732 25-Oct-2011 mm

Fix typo in copyright notice introduced in r226724
(missing character in e-mail adress)

Reported by: pjd
MFC after: 3 days


226724 25-Oct-2011 mm

Update copyright information in several ZFS files, as the clause 3.3
of the CDDL licence explicitly requires every Contributor to add
a copyright notice.

This also reflects the copyright notices for the changes recently
added by Illumos.

MFC after: 3 days


226707 24-Oct-2011 pjd

- Use better naming now that we allow to rename any mounted file system (not
only legacy).
- Update copyright to include myself.

MFC after: 2 weeks


226700 24-Oct-2011 pjd

Don't forget to rename mounted snapshots of the file system being renamed.

MFC after: 2 weeks


226678 24-Oct-2011 pjd

Include <sys/zfs_vfsops.h> only when compiling kernel module.

MFC after: 2 weeks


226676 24-Oct-2011 pjd

Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.

This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.

MFC after: 2 weeks


226620 21-Oct-2011 pjd

Update per-thread I/O statistics collection in ZFS.
This allows to see processes I/O activity in 'top -m io' output.

PR kern/156218
Reported by: Marcus Reid <marcus@blazingdot.com>
Patch by: avg
MFC after: 3 days


226617 21-Oct-2011 pjd

zfs vdev_file_io_start: validate vdev before using vdev_tsd

vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

Submitted by: avg
MFC after: 3 days


226512 18-Oct-2011 mm

Import fix for Illumos bug #1475 to reduce diff against upstream.

Panic caused by this bug was already partially fixed by pjd@
in p4 CH 185940 and 185942.

Reference:
1475 zfs spill block hold can access invalid spill blkptr
https://www.illumos.org/issues/1475

Reviewed by: delphij
Obtained from: Illumos (issue 1475, changeset 13469:b8e89e5c4167)
MFC after: 1 week


226483 17-Oct-2011 delphij

Fix a bug in sa_find_sizes() which could lead to panic:

When calculating space needed for SA_BONUS buffers,
hdrsize is always rounded up to next 8-aligned boundary.
However, in two places the round up was done against
sum of 'total' plus hdrsize. On the other hand,
hdrsize increments by 4 each time, which means in
certain conditions, we would end up returning with
will_spill == 0 and (total + hdrsize) larger than
full_space, leading to a failed assertion because
it's invalid for dmu_set_bonus.

Sponsored by: iXsystems, Inc.
Reviewed by: mm
MFC after: 3 days


225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


225418 06-Sep-2011 kib

Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)


225166 25-Aug-2011 mm

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


225153 24-Aug-2011 pjd

We need to unlock and destroy vnode attached to znode which we are freeing.

Reviewed by: kib
Approved by: re (bz)
MFC after: 1 week


224855 13-Aug-2011 mm

zfs_ioctl.c: improve code readability in zfs_ioc_dataset_list_next()

zvol.c: fix calling of dmu_objset_prefetch() in zvol_create_minors()
by passing full instead of relative dataset name and prefetching all
visible datasets to be processed later instead of just the pool name

Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M opensolaris/uts/common/fs/zfs/zfs_ioctl.c
M opensolaris/uts/common/fs/zfs/zvol.c


224814 13-Aug-2011 mm

Fix race between dmu_objset_prefetch() invoked from
zfs_ioc_dataset_list_next() and dsl_dir_destroy_check() indirectly
invoked from dmu_recv_existing_end() via dsl_dataset_destroy() by not
prefetching temporary clones, as these count as always inconsistent.
In addition, do not prefetch hidden datasets at all as we are not
going to process these later.

Filed as Illumos Bug #1346

PR: kern/157728
Tested by: Borja Marcos <borjam@sarenet.es>, mm
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week


224791 12-Aug-2011 pjd

Eliminate the zfsdev_state_lock entirely and replace it with the
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.

Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week


224605 02-Aug-2011 mm

Fix panic in zfs_read() if IO_SYNC flag supplied by checking for
zfsvfs->z_log before calling zil_commit(). [1]
Do not call zfs_read() from zfs_getextattr() with the IO_SYNC flag.

Submitted by: Alexander Zagrebin <alex@zagrebin.ru> [1]
Reviewed by: pjd@
Approved by: re (kib)
MFC after: 3 days


224579 01-Aug-2011 mm

Fix integer overflow in txg_delay() by initializing
the variable "timeout" as clock_t.

Filed as Illumos Bug #1313

Reviewed by: avg
Approved by: re (kib)
MFC after: 3 days


224526 30-Jul-2011 mm

Fix serious bug in ZIL that can lead to pool corruption
in the case of a held dataset during remount.

Detailed description is available at:
https://www.illumos.org/issues/883

illumos-gate revision: 13380:161b964a0e10

Reviewed by: pjd
Approved by: re (kib)
Obtained from: Illumos (Bug #883)
MFC after: 3 days


224252 21-Jul-2011 delphij

Bring the code more in-line with OpenSolaris source to
ease future port.

Reviewed by: pjd, mm
Approved by: re (kib)


224251 21-Jul-2011 delphij

A different implementation of r224231 proposed by pjd@,
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().

Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)


224231 20-Jul-2011 delphij

Add a new field to in-core znode, z_rdev, to represent device nodes.

PR: kern/159010
Reviewed by: mm@
Approved by: re (kib)
MFC after: 2 weeks


224177 18-Jul-2011 mm

ZFS tries to allocate blocks evenly across all devices. This means when
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.

New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)

Illumos-gate changeset: 13379:4df42cc92254

Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks


224174 18-Jul-2011 mm

Resurrect the ZFS "aclmode" property
Change default of "aclmode" to "discard".

Illumos-gate changeset: 13370:8c04143bd318

Obtained from: Illumos (Feature #742)
MFC after: 2 weeks


223623 28-Jun-2011 mm

Add a new "REFCOMPRESSRATIO" property.

For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).

This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).

Illumos-gate revision: 13387

Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks


223622 28-Jun-2011 mm

Disable vdev cache (readahead) by default.

The vdev cache is very underutilized (hit ratio 30%-70%) and may consume
excessive memory on systems with many vdevs.

Illumos-gate revision: 13346

Obtained from: Illumos (Bug #175)
MFC after: 1 week


223262 18-Jun-2011 benl

Fix clang warnings.

Approved by: philip (mentor)


222950 10-Jun-2011 gibbs

Remove C constructs that are incompatible with C++ from various
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.

Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.

cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.

sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.

Sponsored by: Spectra Logic Corporation


222835 07-Jun-2011 mm

Silence notice on pool creation, import and access.

Suggested by: Jeremy Chadwick (freebsd-stable@)
Discussed with: pjd
MFC after: 1 week


222268 24-May-2011 pjd

Don't pass pointer to name buffer which is on the stack to another thread,
because the stack might be paged out once the other thread tries to use the
data. Instead, just allocate memory.

MFC after: 2 weeks


222267 24-May-2011 pjd

Don't access task structure once we call task function.
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.

Submitted by: anonymous
MFC after: 2 weeks


222199 22-May-2011 rmacklem

Fix the zfs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by: pjd


222167 22-May-2011 rmacklem

Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by: kib


222050 18-May-2011 mm

Restore old (v15) behaviour for a recursive snapshot destroy.
(zfs destroy -r pool/dataset@snapshot)

To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.

Filed as Illumos Bug #1043

Reviewed by: delphij
Approved by: pjd
MFC after: together with v28


221409 03-May-2011 marius

Convert the last use of xcopyout() to ddi_copyout() and remove the now
unused xcopyin() as well as xcopyout().
MFC together with r219089.

Approved by: mm


221263 30-Apr-2011 mm

Fix deduplicated zfs receive
(dmu_recv_stream builds incomplete guid_to_ds_map)

Illumos-gate changeset: 13329:c48b8bf84ab7
MFC together with v28

Approved by: pjd
Obtained from: Illumos (Bug #755)


219973 24-Mar-2011 pjd

Checking file access on size change is bogus. The checks are done earlier by
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.

PR: standards/154873
Reported by: Mark Martinec <Mark.Martinec@ijs.si>
Discussed with: Eric Schrock <eric.schrock@delphix.com>
Discussed with: Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after: 1 month


219636 14-Mar-2011 pjd

Fix potential panic in dbuf_sync_list() relate to spill blocks handling.

Obtained from: IllumOS
MFC after: 1 month


219404 08-Mar-2011 pjd

Correct readdir over ZFS handling.

Reported by: Pierre Beyssac <pb@fasterix.frmug.org>
MFC after: 1 month


219320 06-Mar-2011 pjd

Fix libzpool build.

MFC after: 1 month


219317 05-Mar-2011 pjd

Make renaming of a ZVOL, ZVOL's parent directory and ZVOL snapshot work.

Reported by: avg
MFC after: 1 month


219316 05-Mar-2011 pjd

Simplify zvol_remove_minors() a bit.

MFC after: 1 month


219089 27-Feb-2011 pjd

Finally... Import the latest open-source ZFS version - (SPA) 28.

Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.

MFC after: 1 month


218550 11-Feb-2011 kib

For UIO_NOCOPY case of reading request on zfs vnode, which has vm object
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.

Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.

Reviewed by: avg
MFC after: 1 week


218386 06-Feb-2011 trasz

Make it impossible to clear the MNT_NFS4ACLS flag on ZFS filesystem
by using "mount -uw".

Reviewed by: pjd
MFC after: 2 weeks


218278 04-Feb-2011 ae

vdev's sectorsize should not be greater than 8 Kbytes and also
it should be power of 2. This prevents non-aligned access while
probing vdev's labels.

PR: kern/147852
Reviewed by: pjd
MFC after: 1 week


217588 19-Jan-2011 trasz

Add MNT_NFS4ACLS to ZFS mount flags. It's not conditional, since there
is no way to disable NFSv4 ACLs in ZFS. This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.

Reviewed by: pjd
MFC after: 2 weeks


217367 13-Jan-2011 mdf

Re-commit the zfs sysctl(9) type-safety changes.

Thanks to dim and pjd for the pointer to zfs_context.h for building
userland.


217332 12-Jan-2011 mdf

Revert cddl changes for sysctl(9) until I understand why this isn't
building on universe.


217319 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the zfs piece.


216919 03-Jan-2011 mm

MFp4 r186485, r186859:

Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.

Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days


216378 11-Dec-2010 pjd

Remove redundant semicolon and empty like.


216256 07-Dec-2010 ivoras

Undo r216230: the interaction between saved ashift in metadata and
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).

Noticed by: pjd


216230 06-Dec-2010 ivoras

Use GEOM stripesize field when calculating ashift. This will enable correct
alignment on drives with large sector sizes (e.g. 4 KiB) but the
implementation might need to be revisited if devices with large stripesizes
appear (e.g. if RAID controllers or flash drives start using the field),
probably by introducing a physsectorsize field in GEOM providers.

Discussed with: mav, mostly silence on freebsd-geom@ and freebsd-fs@


215401 16-Nov-2010 avg

zfs+sendfile: populate all requested pages, not just those already cached

kern_sendfile() uses vm_rdwr() to read-ahead blocks of data to populate
page cache. When sendfile stumbles upon a page that is not populated
yet, it sends out all the mbufs that it collected so far. This
resulted in very poor performance with ZFS when file data is not in the
page cache, because ZFS vop_read for UIO_NOCOPY case populated only
those pages that are already in cache, but not valid. Which means that
most of the time it populated only the first requested page in the
described above scenario.

Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: Alexander Zagrebin <alexz@visp.ru>,
Artemiev Igor <ai@kliksys.ru>
MFC after: 12 days


215397 16-Nov-2010 avg

fix misspelling in a comment

Reported by: Daniel Braniss <danny@cs.huji.ac.il>
MFC after: 3 days


215260 13-Nov-2010 mm

Disable VFS_HOLD placed on mnt_vnodecovered during the mount of a snapshot
and VFS_RELE on a non-existing hold on snapshot parent's z_vfs.

This disables the changes from OpenSolaris onnv-revision 9234:bffdc4fc05c4
(bug IDs: 6792139, 6794830) - not applicable to FreeBSD.

This fixes the process hang if umounting a manually mounted snapshot.

Reported by: Alexander Zagrebin <alexz@visp.ru>
Approved by: delphij (mentor)
MFC after: 1 week


214854 05-Nov-2010 delphij

Validate whether the zfs_cmd_t submitted from userland is not smaller than
what we have. Without the check the kernel could accessing memory that
does not belong to the request struct.

Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.

Reviewed by: pjd at MeetBSD CA '2010
MFC after: 1 week


214378 26-Oct-2010 mm

Bugfix merge from OpenSolaris:

OpenSolaris onnv-revision: 10209:91f47f0e7728
6830541 zfs_get_data_trips on a verify
6696242 multiple zfs_fillpage() zfs: accessing past end of object panics
6785914 zfs fails to drop dn_struct_rwlock in recovery code path

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after: 2 weeks


213937 16-Oct-2010 avg

zfs: add vop_getpages method implementation

This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.

To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency. Also, eventually vop_putpages should be added too.

Reviewed by: kib, mm, pjd
MFC after: 3 weeks


213790 13-Oct-2010 rpaulo

In zfs_post_common(), use %d instead of %hhu.

Found with: clang


213730 12-Oct-2010 avg

zfs + sendfile: do not produce partially valid pages for vnode's tail

Since r212650 and before this change sendfile(2) could produce
a partially valid page for a trailing portion of a ZFS vnode.
vm_fault() always wants to see a fully valid page even if it's the last
page that partially extends beyond vnode's end. Otherwise it calls
vop_getpages() to bring in the page. In the case of ZFS this means
that the data is read from the page into the same page and this breaks
checks in ZFS mappedread() - a thread that set VPO_BUSY on the page in
vm_fault() will get blocked forever waiting for it to be cleared.

Many thanks to Kai and Jeremy for reproducing the issue and providing
important debugging information and help.

Reported by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Tested by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Reviewed by: kib
MFC after: 3 days
To-Do: apply the same treatment to tmpfs + sendfile


213673 10-Oct-2010 pjd

Provide internal ioflags() function that converts ioflag provided by FreeBSD's
VFS to OpenSolaris-specific ioflag expected by ZFS. Use it for read and write
operations.

Reviewed by: mm
MFC after: 1 week


213634 08-Oct-2010 mm

Change FAPPEND to IO_APPEND as this is a ioflag and not a fflag.
This corrects writing to append-only files on ZFS.

PR: kern/149495 [1], kern/151082 [2]
Submitted by: Daniel Zhelev <daniel@zhelev.biz> [1], Michael Naef <cal@linu.gs> [2]
Approved by: delphij (mentor)
MFC after: 1 week


213198 27-Sep-2010 mm

Properly handle IO with B_FAILFAST
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted

OpenSolaris revision and Bug IDs:

9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843014)
MFC after: 3 weeks


213197 27-Sep-2010 mm

Enable offlining of log devices.

OpenSolaris revision and Bug IDs:

9701:cc5b64682e64
6803605 should be able to offline log devices
6726045 vdev_deflate_ratio is not set when offlining a log device
6599442 zpool import has faults in the display

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after: 3 weeks


212951 21-Sep-2010 avg

zfs_map_page/zfs_unmap_page: do not use sched_pin() and SFB_CPUPRIVATE

zfs_map_page/zfs_unmap_page are mostly called around potential I/O paths
and it seems to be a not very good idea to do cpu pinning there.

Suggested by: kib
MFC after: 2 weeks


212950 21-Sep-2010 avg

zfs_vnops: use zfs_map_page/zfs_unmap_page helper functions in another place

MFC after: 2 weeks


212783 17-Sep-2010 avg

zfs arc_reclaim_needed: fix typo in mismerge in r212780

PR: kern/146410, kern/138790
MFC after: 3 weeks
X-MFC with: r212780


212782 17-Sep-2010 avg

zfs+sendfile: advance uio_offset upon reading as well

Picked from analogous code in tmpfs.

MFC after: 1 week


212781 17-Sep-2010 avg

zfs arc_reclaim_needed: remove redundant checks for arc_c_max and arc_c_max

Those checks are not present in upstream code and they are enforced in
actual calculations of delta by which ARC size can be grown or should be
reduced.

MFC after: 3 weeks


212780 17-Sep-2010 avg

zfs arc_reclaim_needed: more reasonable threshold for available pages

vm_paging_target() is not a trigger of any kind for pageademon, but
rather a "soft" target for it when it's already triggered.
Thus, trying to keep 2048 pages above that level at the expense of ARC
was simply driving ARC size into the ground even with normal memory
loads.
Instead, use a threshold at which a pagedaemon scan is triggered, so
that ARC reclaiming helps with pagedaemon's task, but the latter still
recycles active and inactive pages.

PR: kern/146410, kern/138790
MFC after: 3 weeks


212694 15-Sep-2010 mm

Fix kernel panic when moving a file to .zfs/shares
Fix possible loss of correct error return code in ZFS mount

OpenSolaris revisions and Bug IDs:

11824:53128e5db7cf
6863610 ZFS mount can lose correct error return

12079:13822b941977
6939941 problem with moving files in zfs (142901-12)

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days


212657 15-Sep-2010 avg

zfs vn_has_cached_data: take into account v_object->cache != NULL

This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages. I believe this situation to be non-essential.

In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.

This change allows us to re-enable vn_has_cached_data() check in zfs_write.

NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.

Discussed with: alc, kib
Approved by: pjd
Tested with: tools/regression/fsx
MFC after: 3 weeks


212655 15-Sep-2010 avg

zfs mappedread, update_pages: use int for offset and length within a page

uint64_t, int64_t were redundant there

Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212654 15-Sep-2010 avg

zfs mappedread: use uiomove_fromphys where possible

Reviewed by: alc
Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212652 15-Sep-2010 avg

zfs: catch up with vm_page_sleep_if_busy changes

Reviewed by: alc
Approved by: pjd
Tested by: tools/regression/fsx
MFC after: 2 weeks


212650 15-Sep-2010 avg

tmpfs, zfs + sendfile: mark page bits as valid after populating it with data

Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).

PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks


212611 14-Sep-2010 mm

Remove duplicated VFS_HOLD due to a mismerge.

PR: kern/150544
Approved by: delphij (mentor)
MFC after: 1 day


212605 14-Sep-2010 mm

Add missing vop_vector zfsctl_ops_shares
Add missing locks around VOP_READDIR and VOP_GETATTR with z_shares_dir

PR: kern/150544
Approved by: delphij (mentor)
Obtained from: perforce (pjd)
MFC after: 1 day


212573 13-Sep-2010 pjd

Remove the page queues lock around vm_page_undirty() - it is no longer needed.

Reviewed by: alc


212494 12-Sep-2010 rpaulo

Revamp locking a bit. This fixes three problems:
* processes now can't go away while we are inserting probes (fixes a panic)
* if a trap happens, we won't be holding the process lock (fixes a hang)
* fix a LOR between the process lock and the fasttrap bucket list lock

Thanks to kib for pointing some problems.
Sponsored by: The FreeBSD Foundation


212465 11-Sep-2010 rpaulo

Avoid a LOR (sleepable after non-sleepable) in
fasttrap_tracepoint_enable().

Sponsored by: The FreeBSD Foundation


212425 10-Sep-2010 mdf

Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.


212385 09-Sep-2010 pjd

On FreeBSD we can log from pool that have multiple top-level vdevs or log
vdevs, so don't deny adding new vdevs if bootfs property is set.

MFC after: 2 weeks


212357 09-Sep-2010 rpaulo

Fix two bugs in DTrace:
* when the process exits, remove the associated USDT probes
* when the process forks, duplicate the USDT probes.

Sponsored by: The FreeBSD Foundation


212160 02-Sep-2010 gibbs

Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.

o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().

o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.

o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.

o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.

sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.

Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by: Spectra Logic Corporation
MFC after: 1 month


212002 30-Aug-2010 jh

execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)


211948 28-Aug-2010 pjd

Return NULL pointer instead of B_FALSE as it is done in the vendor code.

Obtained from: //depot/user/pjd/zfs/...


211947 28-Aug-2010 pjd

Move ZUT_OBJS in the same place that is used in vendor code.

Obtained from: //depot/user/pjd/zfs/...


211932 28-Aug-2010 mm

Import changes from OpenSolaris that provide
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes

Detailed information (OpenSolaris onnv changesets and Bug IDs):

9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer

9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership

9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters

10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission

10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir

10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)

10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL

10295:f7a18a1e9610
6870564 panic in zfs_getsecattr

Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks


211931 28-Aug-2010 mm

Update ZFS metaslab code from OpenSolaris.
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.

Detailed information (OpenSolaris onnv changesets and Bug IDs):

11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently

11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics

12047:7c1fcc8419ca
6917066 zfs block picking can be improved

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks


211929 28-Aug-2010 rpaulo

Remove debugging.

Sponsored by: The FreeBSD Foundation


211925 28-Aug-2010 rpaulo

Replace a memory barrier with a mutex barrier.

Sponsored by: The FreeBSD Foundation


211900 27-Aug-2010 pjd

Use ZFS_CTLDIR_NAME instead of hardcoding ".zfs".


211855 26-Aug-2010 pjd

Update comment now that I finally committed r211854.

MFC after: 1 month


211762 24-Aug-2010 avg

zfs arc_reclaim_thread: no need to call arc_reclaim_needed when
resetting needfree

needfree is checked at the very start of arc_reclaim_needed.
This change makes code easier to follow and maintain in face of
potential changed in arc_reclaim_needed.

Also, put the whole sub-block under _KERNEL because needfree can be set
only in kernel code.

To do: rename needfree to something else to aovid confusion with
OpenSolaris global variable of the same name which is used in the same
code, but has different meaning (page deficit).

Note: I have an impression that locking around accesses to this variable
as well as mutual notifications between arc_reclaim_thread and
arc_lowmem are not proper.

MFC after: 1 week


211745 24-Aug-2010 rpaulo

Replace a pksignal() call with tdksignal().

Pointed out by: kib


211744 24-Aug-2010 rpaulo

MD fasttrap implementation.

Sponsored by: The FreeBSD Foundation


211738 24-Aug-2010 rpaulo

Port the fasttrap provider to FreeBSD. This provider is responsible for
injecting debugging probes in the userland programs and is the basis for
the pid provider and the usdt provider.

Sponsored by: The FreeBSD Foundation


211618 22-Aug-2010 rpaulo

Port this to FreeBSD. We miss some suword functions, so we use copyout.

Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M sys/fasttrap_impl.h


211608 22-Aug-2010 rpaulo

Kernel DTrace support for:
o uregs (sson@)
o ustack (sson@)
o /dev/dtrace/helper device (needed for USDT probes)

The work done by me was:
Sponsored by: The FreeBSD Foundation


211606 22-Aug-2010 rpaulo

Add the FreeBSD definition for the fasttrap ioctls.

Sponsored by: The FreeBSD Foundation


211555 21-Aug-2010 rpaulo

Port the DTrace helper ioctls to FreeBSD and add a helper member to
dof_helper_t (needed by drti.o).

Sponsored by: The FreeBSD Foundation


211484 19-Aug-2010 imp

First cut at mips n64 ABI support


210999 07-Aug-2010 pjd

In FreeBSD we use 'jailed' property.

MFC after: 2 weeks


210470 25-Jul-2010 mm

Import two changesets from OpenSolaris to make future updates easier.

The changes do not affect FreeBSD code because
zfs_znode_move(), cleanlocks() and cleanshares() are not used.

OpenSolaris onnv changeset: 9788:f660bc44f2e8, 9909:aa280f585a3e

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843700, 6790232)
MFC after: 7 weeks


210457 24-Jul-2010 mm

Consider snapshots as descendants via zfs allow -d

OpenSolaris onnv changeset: 9847:2f3ba86e857a

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6809340)
MFC after: 1 week


210427 23-Jul-2010 avg

zfs arc_memory_throttle: available memory is free + cache

OpenSolaris freemem has the same meaning as our v_free_count +
v_cache_count.

Obtained from: Artem Belevich <fbsdlist@src.cx>,
Peter Jeremy <peterjeremy@acm.org>
Discussed with: pjd
MFC after: 2 weeks


210398 22-Jul-2010 mm

Enable fake resolving of SMB RIDs by using nulldomain and UID_NOBODY
- fixes panics when Solaris/OpenSolaris pools that contain files
uploaded with the SMB protocol are accessed

Enable seting/unsetting the sharesmb property (dummy action)
- allows users who import pools from Solaris/Opensolaris to unset
the sharesmb property and get rid of annoying messages

PR: kern/145778, kern/148709
Approved by: pjd, delphij (mentor)
MFC after: 7 weeks


210282 20-Jul-2010 mm

To improve latency, lower default vfs.zfs.vdev.max_pending from 35 to 10

OpenSolaris onnv changeset (partial): 10801:e0bf032e8673

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6891731)
MFC after: 1 week


210192 17-Jul-2010 nwhitehorn

Increase stack size for ZFS sync thread. This is required to make ZFS
function on 64-bit PowerPC.

Reviewed by: pjd
Obtained from: OpenSolaris changeset 14653:7cf402a7f374


210172 16-Jul-2010 jhb

Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with: attilio


210171 16-Jul-2010 jhb

When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after: 3 days


209962 13-Jul-2010 mm

Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced
in Solaris 10 updates 141445-09 and 142901-14.

Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)

7844:effed23820ae
6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)

7897:e520d8258820
6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)

7965:b795da521357
6740164 zpool attach can create an illegal root pool (141909-02)

8084:b811cc60d650
6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)

8121:7fd09d4ebd9c
6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)

8129:e4f45a0bfbb0
6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)

8188:fd00c0a81e80
6761100 want zdb option to select older uberblocks (141445-01)

8190:6eeea43ced42
6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)

8225:59a9961c2aeb
6737463 panic while trying to write out config file if root pool import fails (141445-01)

8227:f7d7be9b1f56
6765294 Refactor replay (141445-01)

8228:51e9ca9ee3a5
6572357 libzfs should do more to avoid mnttab lookups (141909-01)
6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)

8241:5a60f16123ba
6328632 zpool offline is a bit too conservative (141445-01)
6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698 checksum failures after offline -t / export / import / scrub (141445-01)
6745863 ZFS writes to disk after it has been offlined (141445-01)
6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107 I/O should never suspend during spa_load() (141445-01)
6776548 codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)

8242:e46e4b2f0a03
6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)

8269:03a7e9050cfd
6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194 "zfs get" VALUE column is as wide as NAME (141445-01)
6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518 ASSERT strings shouldn't be pre-processed (141445-01)

8274:846b39508aff
6713916 scrub/resilver needlessly decompress data (141445-01)

8343:655db2375fed
6739553 libzfs_status msgid table is out of sync (141445-01)
6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108 zfs_realloc() should not free original memory on failure (141445-01)

8525:e0e0e525d0f8
6788830 set large value to reservation cause core dump (141445-01)
6791064 want sysevents for ZFS scrub (141445-01)
6791066 need to be able to set cachefile on faulted pools (141445-01)
6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)

8547:bcc7b46e5ff7
6792884 Vista clients cannot access .zfs (141445-01)

8632:36ef517870a3
6798384 It can take a village to raise a zio (141445-01)

8636:7e4ce9158df3
6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)

8692:692d4668b40d
6801507 ZFS read aggregation should not mind the gap (141445-01)

8697:e62d2612c14d
6633095 creating a filesystem with many properties set is slow (141445-01)

8768:dfecfdbb27ed
6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)

8811:f8deccf701cf
6790687 libzfs mnttab caching ignores external changes (141445-01)
6791101 memory leak from libzfs_mnttab_init (141445-01)

8845:91af0d9c0790
6800942 smb_session_create() incorrectly stores IP addresses (N/A)
6582163 Access Control List (ACL) for shares (141445-01)
6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184 Panic at smb_oplock_conflict+0x35() (N/A)

8876:59d2e67b4b65
6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)

8924:5af812f84759
6789318 coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)

9030:243fd360d81f
6815893 hang mounting a dataset after booting into a new boot environment (141445-01)

9056:826e1858a846
6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)

9179:d8fbd96b79b3
6790064 zfs needs to determine uid and gid earlier in create process (141445-01)

9214:8d350e5d04aa
6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)

9229:e3f8b41e5db4
6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)

9230:e4561e3eb1ef
6821169 offlining a device results in checksum errors (141445-01)
6821170 ZFS should not increment error stats for unavailable devices (141445-01)
6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)

9234:bffdc4fc05c4
6792139 recovering from a suspended pool needs some work (141445-01)
6794830 reboot command hangs on a failed zfs pool (141445-01)

9246:67c03c93c071
6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)

9276:a8a7fc849933
6816124 System crash running zpool destroy on broken zpool (141445-03)

9355:09928982c591
6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)

9391:413d0661ef33
6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)

9396:f41cf682d0d3 (part already merged)
6501037 want user/group quotas on ZFS (141445-03)
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)

9404:319573cd93f8
6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)

9412:4aefd8704ce0
6717022 ZFS DMU needs zero-copy support (141445-03)

9425:e7ffacaec3a8
6799895 spa_add_spares() needs to be protected by config lock (141445-03)
6826466 want to post sysevents on hot spare activation (141445-03)
6826468 spa 'allowfaulted' needs some work (141445-03)
6826469 kernel support for storing vdev FRU information (141445-03)
6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471 I/O errors after device remove probe can confuse FMA (141445-03)
6826472 spares should enjoy some of the benefits of cache devices (141445-03)

9443:2a96d8478e95
6833711 gang leaders shouldn't have to be logical (141445-03)

9463:d0bd231c7518
6764124 want zdb to be able to checksum metadata blocks only (141445-03)

9465:8372081b8019
6830237 zfs panic in zfs_groupmember() (141445-03)

9466:1fdfd1fed9c4
6833162 phantom log device in zpool status (141445-03)

9469:4f68f041ddcd
6824968 add ZFS userquota support to rquotad (141445-03)

9470:6d827468d7b5
6834217 godfather I/O should reexecute (141445-03)

9480:fcff33da767f
6596237 Stop looking and start ganging (141909-02)

9493:9933d599bc93
6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)

9512:64cafcbcc337
6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)

9515:d3b739d9d043
6586537 async zio taskqs can block out userland commands (142901-09)

9554:787363635b6a
6836768 zfs_userspace() callback has no way to indicate failure (N/A)

9574:1eb6a6ab2c57
6838062 zfs panics when an error is encountered in space_map_load() (141909-02)

9583:b0696cd037cc
6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)

9630:e25a03f552e0
6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)

9653:a70048a304d1
6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)

9688:127be1845343
6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069 zfs get userused@S-1-... doesn't work (N/A)

9873:8ddc892eca6e
6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)

9904:d260bd3fd47c
6838344 kernel heap corruption detected on zil while stress testing (141445-06)

9951:a4895b3dd543
6844900 zfs_ioc_userspace_upgrade leaks (N/A)

10040:38b25aeeaf7a
6857012 zfs panics on zpool import (141445-06)

10000:241a51d8720c
6848242 zdb -e no longer works as expected (N/A)

10100:4a6965f6bef8
6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)

10160:a45b03783d44
6861983 zfs should use new name <-> SID interfaces (N/A)
6862984 userquota commands can hang (141445-06)

10299:80845694147f
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)

10302:a9e3d1987706
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)

10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)

10800:469478b180d9
6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)

10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)

10810:b6b161a6ae4a
6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)

10890:499786962772
6807339 spurious checksum errors when replacing a vdev (142901-13)

11249:6c30f7dfc97b
6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)

11454:6e69bacc1a5a
6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)

11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)

Discussed with: pjd
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 months


209721 06-Jul-2010 rpaulo

Merge from vendor-sys/opensolaris:
* add fasttrap files


209275 17-Jun-2010 mm

Import latest ARC change from OpenSolaris:
- large ghost eviction causes high write latency
- arc_adjust might adjust MRU unnecessarily
- arc_adapt can lead to wild arc_p adjustment

OpenSolaris onnv-revision: 12636:13b5d698941e

Submitted by: avg
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6950219, 6953403, 6951024)
MFC after: 1 month


209261 17-Jun-2010 pjd

Turn off UMA allocations on all archs by default. It isn't stable even on
amd64.

Reported by: many
MFC after: 3 days


209230 16-Jun-2010 pjd

Remove redundant assignment.

MFC after: 3 days


209101 12-Jun-2010 mm

Fix arc_read_done may try to byteswap undefined data (sparc related)

OpenSolaris onnv-revision: 10839:cf83b553a2ab

Obtained from: OpenSolaris (Bug ID 6836714)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209100 12-Jun-2010 mm

Fix panic in zfs_getsecattr

OpenSolaris onnv-revision: 10295:f7a18a1e9610

Obtained from: OpenSolaris (Bug ID 6870564)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209099 12-Jun-2010 mm

Fix possible zfs panic on zpool import

OpenSolaris onnv-revision: 10040:38b25aeeaf7a

Obtained from: OpenSolaris (Bug ID 6857012)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209098 12-Jun-2010 mm

Fix zpool resilver stalls with spa_scrub_thread in a 3 way deadlock

OpenSolaris onnv-revision: 9997:174d75a29a1c

Obtained from: OpenSolaris (Bug ID 6843235)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209097 12-Jun-2010 mm

Fix ZFS panic deadlock: cycle in blocking chain via zfs_zget

OpenSolaris onnv-revision: 9774:0bb234ab2287

Obtained from: OpenSolaris (Bug ID 6788152)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209096 12-Jun-2010 mm

Fix vdev_probe() starvation brings txg train to a screeching halt

OpenSolaris onnv-revision: 9722:e3866bad4e96

Obtained from: OpenSolaris (Bug ID 6844069)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209095 12-Jun-2010 mm

Fix incomplete resilvering after disk replacement (raidz)

OpenSolaris onnv-revision: 9434:3bebded7c76a

Obtained from: OpenSolaris (Bug ID 6794570)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209094 12-Jun-2010 mm

Fix zfs destroy fails to free object in open context, stops up txg train

OpenSolaris onnv-revision: 9409:9dc3f17354ed

Obtained from: OpenSolaris (Bug ID 6809683)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209093 12-Jun-2010 mm

Fix unable to remove a file over NFS after hitting refquota limit

OpenSolaris onnv-revision: 8890:8c2bd5f17bf2

Obtained from: OpenSolaris (Bug ID 6798878)
Approved by: pjd, delphij (mentor)
MFC after: 3 days


209059 11-Jun-2010 jhb

Update several places that iterate over CPUs to use CPU_FOREACH().


208775 03-Jun-2010 mm

Fix freeing space after deleting large files with holes.

OpenSolaris onnv revision: 9950:78fc41aa9bc5

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6792701)
MFC after: 3 days


208689 01-Jun-2010 mm

Fix ZIL close when doing zfs rollback or zfs receive on a mounted dataset.

The fix is a partial import and merge of OpenSolaris onnv revisions
8227:f7d7be9b1f56. and 9292:e112194b5b73

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6798298)
MFC after: 3 days


208683 31-May-2010 pjd

Fix a bug where resilver is not started automatically on pool import or load.
If disk was missing on pool load or import and on next pool load or import
it was present, resilver wasn't started automatically and ZFS reported all disks
as ONLINE and healthy. Then, when another disk died, pool became unaccessible,
because if it was 2-way mirror or RAIDZ1 two vdevs were out of sync.

To fix the problem, start resilver automatically on pool load or import.

Obtained from: OpenSolaris
MFC after: 3 days


208682 31-May-2010 pjd

Fix panic when reading label from provider with non power of 2 sector size.

Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
MFC after: 3 days


208474 23-May-2010 mm

Remove kstat.zfs.arcstats.l2_write_bytes_written

The arcstats.l2_write_bytes_written kstat counter introduced
in r205231 was duplicite with vendor's arcstats.l2_write_bytes counter
imported in r208373 (OpenSolaris revision 8582:df9361868dbe)

Approved by: pjd, delphij (mentor)
MFC after: 3 days


208472 23-May-2010 mm

Fix zfs receive temporarily changing unchanged stream properties.
Fix possible panic with zfs_enable_datasets.

OpenSolaris onnv revision: 8536:33bd5de3260e

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6748561, 6757075)
MFC after: 3 days


208458 23-May-2010 pjd

Create UMA zones unconditionally.

MFC after: 3 days


208454 23-May-2010 pjd

Remove ZIO_USE_UMA from arc.c as well.

MFC after: 3 days


208443 23-May-2010 mm

Fix kernel panic when calling spa_tryimport() on a corrupted pool.

OpenSolaris onnv revision: 8680:005fe27123ba

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6786321)
MFC after: 1 day


208442 23-May-2010 mm

Fix mutex_exit misorder that can cause a kernel panic.

OpenSolaris onnv revision: 8667:5c308a17eb7c

Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6795440)
MFC after: 1 day


208373 21-May-2010 mm

Update L2ARC code and fix several bugs.

- improve ARC memory consumption (Bug ID 6488341)
- ARC/L2ARC metadata accounting (Bug ID 6748019)
- L2ARC turbo warmup (Bud ID 6748023)
- kstats for ARC content (Bug ID 6748023)
- kstats for evicted bytes from ARC by L2ARC state (Bud ID 6871680)
- fix panic on i386 systems (Bug ID 6821260)

OpenSolaris onnv revisions:
8582:df9361868dbe, 8628:97dcded6e556, 9215:7c4584f76b47,
9274:a10f8bd993c1, 10357:29060492b29d

OpenSolaris Bug IDs:
6748019, 6748023, 6748030, 6488341, 6798268, 6821260, 6790261, 6871680

Approved by: pjd, delphij (mentor)
Obtained from: OpenSlaris (multiple bug IDs)
MFC after: 3 days


208372 21-May-2010 mm

Reorder some already introduced locking variables.

OpenSolaris onnv revision: 8214:d7abf7c1f1c1

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6747934)
MFC after: 3 days


208371 21-May-2010 mm

Fix stack overflow in zfs send.

OpenSolaris onnv-revision: 8012:8ea30813950f

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6765626)
MFC after: 3 days


208370 21-May-2010 mm

Fix: vdev_reopen() can lead to failed allocations

OpenSolaris onnv-revision: 7980:589f37f25048

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6764914)
MFC after: 3 days


208166 16-May-2010 pjd

Fix userland build by making io_task available only for the kernel and by
providing taskq_dispatch_safe() macro.

MFC after: 1 week


208148 16-May-2010 pjd

Allow to configure UMA usage for ZIO data via loader and turn it on by
default for amd64. On i386 I saw performance degradation when UMA was used,
but for amd64 it should help.

MFC after: 3 days


208147 16-May-2010 pjd

Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.

Prodded by: kib
MFC after: 1 week


208142 16-May-2010 pjd

The whole point of having dedicated worker thread for each leaf VDEV was to
avoid calling zio_interrupt() from geom_up thread context. It turns out that
when provider is forcibly removed from the system and we kill worker thread
there can still be some ZIOs pending. To complete pending ZIOs when there is
no worker thread anymore we still have to call zio_interrupt() from geom_up
context. To avoid this race just remove use of worker threads altogether.
This should be more or less fine, because I also thought that zio_interrupt()
does more work, but it only makes small UMA allocation with M_WAITOK.
It also saves one context switch per I/O request.

PR: kern/145339
Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com>
MFC after: 1 week


208131 16-May-2010 mm

Fix deadlock between zfs_dirent_lock and zfs_rmdir

OpenSolaris onnv revision: 11321:506b7043a14c

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6847615)
MFC after: 3 days


208130 16-May-2010 mm

Fix perfomance problem with ZFS prefetch caching [1]
Add statistics for ZFS prefetch (sysctl kstat.zfs.misc.zfetchstats)

Partial import of OpenSolaris onnv revision 10474:0e96dd3b905a

Reported by: jhell@dataix.net (private e-mail) [1]
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6859997, 6868951)
MFC after: 3 days


208050 13-May-2010 mm

Fix ZIL-related panic on zfs rollback.

OpenSolaris onnv-revision: 8746:e1d96ca6808c

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6796377)
MCF after: 1 week


208047 13-May-2010 mm

Import OpenSolaris revision 7837:001de5627df3
It includes the following changes:
- parallel reads in traversal code (Bug ID 6333409)
- faster traversal for zfs send (Bug ID 6418042)
- traversal code cleanup (Bug ID 6725675)
- fix for two scrub related bugs (Bug ID 6729696, 6730101)
- fix assertion in dbuf_verify (Bug ID 6752226)
- fix panic during zfs send with i/o errors (Bug ID 6577985)
- replace P2CROSS with P2BOUNDARY (Bug ID 6725680)

List of OpenSolaris Bug IDs:
6333409, 6418042, 6757112, 6725668, 6725675, 6725680,
6725698, 6729696, 6730101, 6752226, 6577985, 6755042

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 1 week


208030 13-May-2010 trasz

Add missing check to prevent local users from panicing the kernel by trying
to set malformed ACL.

MFC after: 3 days


207956 12-May-2010 mm

Fix possible hang when replaying large truncations.

OpenSolaris onnv revision: 7904:6a124a4ca9c5

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6761624)
MFC after: 3 days


207936 11-May-2010 pjd

Eventhough r203504 eliminates taste traffic provoked by vdev_geom.c,
ZFS still like to open all vdevs, close them and open them again,
which in turn provokes taste traffic anyway.

I don't know of any clean way to fix it, so do it the hard way - if we can't
open provider for writing just retry 5 times with 0.5 pauses. This should
elimitate accidental races caused by other classes tasting providers created on
top of our vdevs.

MFC after: 3 days
Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
Reported by: Yuri Pankov <yuri.pankov@gmail.com>


207934 11-May-2010 pjd

Add missing new line characters to the warnings.

MFC after: 3 days


207911 11-May-2010 mm

Fix failed assertion on destroying datasets from an older pool version.

OpenSolaris onnv revision: 9390:887948510f80

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826861)
MFC after: 3 days


207910 11-May-2010 mm

Fix possible panic with zfs destroy.

OpenSolaris onnv revision: 8779:f164e0e90508

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6784924)
MFC after: 3 days


207909 11-May-2010 mm

Fix zfs rename (may occasionally fail with dataset busy).

OpenSolaris onnv revision: 8517:41a0783dde17

PR: kern/146471
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6784757)
MFC after: 3 days


207908 11-May-2010 mm

Fix endianess bug in ZFS intent log (ZIL).

OpenSolaris onnv revision: 8109:6147a1bdd359

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6760048)
MFC after: 3 days


207745 07-May-2010 trasz

Enforce RLIMIT_FSIZE in ZFS.

Reviewed by: pjd@


207683 05-May-2010 marius

- Fix broken symlinks on cross platform zfs send/recv. [1]
- Enable zfs_ace_byteswap() on FreeBSD as it works just fine (tested between
amd64 and sparc64 in both directions by Michael Moll).

PR: 146272
Approved by: mm, pjd
Obtained from: OpenSolaris (onnv rev. 8283:1ca59f393041; Bug ID 6764193) [1]
MFC after: 3 days


207670 05-May-2010 mm

Introduce hardforce export option (-F) for "zpool export".
When exporting with this flag, zpool.cache remains untouched.

OpenSolaris onnv revision: 8211:32722be6ad3b

Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID: 6775357)


207626 04-May-2010 mm

Speed up ZFS list operation with objset prefetching.

Partial import of OpenSolaris onnv revisions:
8415:8809e849f63e, 10474:0e96dd3b905a

PR: kern/146297
Submitted by: myself
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6386929, 6755389, 6847118)
MFC after: 2 weeks


207624 04-May-2010 mm

Fix deadlock during zfs receive.

OpenSolaris onnv revision: 9299:8809e849f63e

PR: kern/146296
Submitted by: myself
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6783818, 6826836)
MFC after: 1 week


207481 01-May-2010 mm

Add sysctl and loader tunable vfs.zfs.txg.write_limit_override.
This tunable improves fine-tuning of ZFS write throttling.

PR: kern/146108
Suggested by: Nikolay Denev <ndenev at gmail.com>
Approved by: pjd, delphij (mentor)
MFC after: 2 weeks


207480 01-May-2010 mm

Change description of tunable group vfs.zfs.txg to be more
understandable.

Approved by: pjd, delphij (mentor)
MFC after: 3 days


207427 30-Apr-2010 mm

Fix improper pool write throughput calculation.

OpenSolaris onnv revision: 9366:17553395a745

PR: kern/146108
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris, Bug ID 6817339
MFC after: 2 weeks


207334 28-Apr-2010 pjd

Backport fix for 'zfs_znode_dmu_init: existing znode for dbuf' panic from OpenSolaris.

PR: kern/144402
Reported by: Alex Bakhtin <alex.bakhtin@gmail.com>
Tested by: Alex Bakhtin <alex.bakhtin@gmail.com>
Obtained from: OpenSolaris, Bug ID 6895088
MFC after: 3 days


207068 22-Apr-2010 pjd

Allow to modify directory's content even if the ZFS_NOUNLINK (SF_NOUNLINK,
sunlnk) flag is set. We only deny dirctory's removal or rename.

PR: kern/143343
Reported by: marck
MFC after: 3 days


206797 18-Apr-2010 pjd

Restore previous order.


206796 18-Apr-2010 pjd

Style fixes.


206795 18-Apr-2010 pjd

Add missing list and lock destruction.


206794 18-Apr-2010 pjd

Extend locks scope to match OpenSolaris.


206793 18-Apr-2010 pjd

Remove racy assertion.

Obtained from: OpenSolaris


206792 18-Apr-2010 pjd

Set ARC_L2_WRITING on L2ARC header creation.

Obtained from: OpenSolaris


206667 15-Apr-2010 pjd

Fix 3-way deadlock that can happen because of ZFS and vnode lock
order reversal.

thread0 (vfs_fhtovp) thread1 (vop_getattr) thread2 (zfs_recv)
-------------------- --------------------- ------------------
vn_lock
rrw_enter_read
rrw_enter_write (hangs)
rrw_enter_read (hangs)
vn_lock (hangs)

Submitted by: Attila Nagy <bra@fsn.hu>
MFC after: 3 days


205346 19-Mar-2010 pjd

The same code is used to import and to create pool.
The order of operations is the following:
1. Try to open vdev by remembered path and guid.
2. If 1 failed, try to find vdev which guid matches and ignore the path.
3. If 2 failed this means either that the vdev we're looking for is gone
or that pool is being created and vdev doesn't contain proper guid yet.
To be able to handle pool creation we open vdev by path anyway.

Because of 3 it is possible that we open wrong vdev on import which can lead to
confusions.

The solution for this is to check spa_load_state. On pool creation it will be
equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it
is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails,
we open by guid. We no longer open wrong vdev on import.

MFC after: 2 weeks


205264 17-Mar-2010 kmacy

- cache line align arcs_lock array (h/t Marius Nuennerich)
- fix ARCS_LOCK_PAD to use architecture defined CACHE_LINE_SIZE
- cache line align buf_hash_table ht_locks array

MFC after: 7 days


205253 17-Mar-2010 kmacy

use CACHE_LINE_SIZE instead of hardcoding 128 for lock pad

pointed out by Marius Nuennerich and jhb@


205231 16-Mar-2010 kmacy

- reduce contention by breaking up ARC state locks in to 16 for data
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)


205133 13-Mar-2010 kmacy

fix compilation under ZIO_USE_UMA


205132 13-Mar-2010 kmacy

Don't bottleneck on acquiring the stream locks - this avoids a massive
drop off in throughput with large numbers of simultaneous reads

MFC after: 7 days


205079 12-Mar-2010 pjd

Remove bogus assertion.

Reported by: Johan Ström <johan@stromnet.se>
Obtained from: OpenSolaris, Bug ID 6827260
MFC after: 1 week


204804 06-Mar-2010 pjd

Remove racy assertion.

Reported by: Attila Nagy <bra@fsn.hu>
Obtained from: OpenSolaris, Bug ID 6827260
MFC after: 1 week


204101 19-Feb-2010 pjd

Don't set f_bsize to recordsize. It might confuse some software (like squid).

Submitted by: Alexander Zagrebin <alexz@visp.ru>
MFC after: 2 weeks


204073 18-Feb-2010 pjd

Add tunable and sysctl to skip hostid check on pool import.


203533 05-Feb-2010 delphij

Remove two files that are not needed by FreeBSD.

Approved by: pjd
MFC after: 2 weeks


203504 04-Feb-2010 pjd

Open provider for writting when we find the right one. Opening too much
providers for writing provokes huge traffic related to taste events send
by GEOM on close. This can lead to various problems with opening GEOM
providers that are created on top of other GEOM providers.

Reorted by: Kurt Touet <ktouet@gmail.com>, mr
Tested by: mr, Baginski Darren <kickbsd@ya.ru>
MFC after: 2 weeks


202129 11-Jan-2010 delphij

Report ZFS filesystem version instead of the zpool version when we say it.

Reported by: Yuri Pankov (on -fs@)
Submitted by: delphij
Approved by: pjd
MFC after: 1 week


201756 07-Jan-2010 delphij

Re-apply onnv-gate revisions 7994 and 8986 (corresponds to FreeBSD
revision 200726 and 200727). It looks like that the two revisions
were not applied in the right sequence, I found this when comparing
with the OpenSolaris code.

MFC after: 3 days
Reviewed by: mm@


201406 02-Jan-2010 delphij

Reduce diff against OpenSolaris - move Giant acquire/release to
zfs_znode.c. As a side effect this also eliminates two potential
Giant leaks.

Approved by: pjd
MFC after: 1 month


201143 28-Dec-2009 delphij

Apply OpenSolaris revision 8012 which brings our zpool to version 14,
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.

PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks


200727 19-Dec-2009 delphij

Apply fix for Solaris bug 6462803: zfs snapshot -r failed because
filesystem was busy
(onnv revision 8989)

Submitted by: mm
Approved by: pjd
Obtained from: OpenSolaris
MFC after: 2 weeks


200726 19-Dec-2009 delphij

Apply fix for Solaris bug 6801979: zfs recv can fail with E2BIG
(onnv revision 8986)

Requested by: mm
Submitted by: pjd
Obtained from: OpenSolaris
MFC after: 2 weeks


200724 19-Dec-2009 delphij

Apply fix Solaris bug 6462803 zfs snapshot -r failed because
filesystem was busy.

Submitted by: mm
Approved by: pjd
MFC after: 2 weeks


200162 05-Dec-2009 kib

Change VOP_FSYNC for zfs vnode from VOP_PANIC to zfs_freebsd_fsync(),
both to not panic when fsync(2) is called for fifo on zfs
filedescriptor, and to actually fsync fifo inode to permanent storage.

PR: kern/141177
Reviewed by: pjd
MFC after: 1 week


200158 05-Dec-2009 pjd

We have to eventually look for provider without checking guid as this is need
for attaching when there is no metadata yet.

Before r200125 the order of looking for providers was wrong. It was:
1. Find provider by name.
2. Find provider by guid.
3. Find provider by name and guid.

Where it should have been:
1. Find provider by name and guid.
2. Find provider by guid.
3. Find provider by name.

MFC after: 1 week


200126 05-Dec-2009 pjd

Fix deadlock when ZVOLs are present and we are replacing dead component or
calling scrub when pool is in a degraded state. It will try to taste ZVOLs,
which will lead to deadlock, as ZVOL will try to acquire the same locks as
replace/scrub is holding already.

We can't simply skip provider based on their GEOM class, because ZVOL can have
providers build on top of it and we need to skip those as well.

We do it by asking for ZFS::iszvol attribute. Any ZVOL-based provider will give
us positive answer and we have to skip those providers.

This way we remove possibility to create ZFS pools on top of ZVOLs, but it is
not very useful anyway.

I believe deadlock is still possible in some very complex situations like when
we have MD provider on top of UFS file on top of ZVOL. When we try to replace
dead component in the pool mentioned ZVOL is based on, there might be a
deadlock when ZFS will try to taste MD provider. There is no easy way to detect
that, but it isn't very common.

MFC after: 1 week


200125 05-Dec-2009 pjd

Always check guid when opening by path, because we may end up with provider
that does have the same name, but only by accident.

MFC after: 1 week


200124 05-Dec-2009 pjd

Avoid using additional variable for storing an error if we are not going
to do anything with it.


199157 10-Nov-2009 pjd

Be careful which vattr fields are set during setattr replay.
Without this fix strange things can appear after unclean shutdown like
files with mode set to 07777.

Reported by: des
MFC after: 3 days


199156 10-Nov-2009 pjd

Avoid passing invalid mountpoint to getnewvnode().

Reported by: rwatson
Tested by: rwatson
MFC after: 3 days


198703 30-Oct-2009 pjd

- zfs_zaccess() can handle VAPPEND too, so map V_APPEND to VAPPEND and call
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.

This fixes "dirtying snapshot!" panic.

PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
MFC after: 3 days


197861 08-Oct-2009 pjd

Allow file system owner to modify system flags if securelevel permits.

MFC after: 3 days


197843 07-Oct-2009 pjd

On FreeBSD it is enough to report provider removal when orphan event is
received, we don't have to do it on every ENXIO error in I/O path.
Solaris has no GEOM so they have to handle it in a less clean way.

MFC after: 3 days


197842 07-Oct-2009 pjd

Fix white-spaces.

MFC after: 3 days


197831 07-Oct-2009 pjd

Fix situation where Mac OS X NFS client creates a file and when it tries
to set ownership and mode in the same setattr operation, the mode was
overwritten by secpolicy_vnode_setattr().

PR: kern/118320
Submitted by: Mark Thompson <info-gentoo@mark.thompson.bz>
MFC after: 3 days


197816 06-Oct-2009 kmacy

Prevent paging pressure from draining arc too much
- always drain arc if above arc_c_max - never drain arc if arc is below arc_c_max

MFC after: 3 days


197683 01-Oct-2009 delphij

Return EOPNOTSUPP instead of EINVAL when doing chflags(2) over an old
format ZFS, as defined in the manual page.

Submitted by: pjd (response of my original patch but bugs are mine)
MFC after: 3 days


197515 26-Sep-2009 pjd

Handle cases where virtual (GFS) vnodes are referenced when doing forced
unmount. In that case we cannot depend on the proper order of invalidating
vnodes, so we have to free resources when we have a chance.

PR: kern/139062
Reported by: trasz
MFC after: 3 days


197514 26-Sep-2009 pjd

On lookup error VFS expects *vpp to be set to NULL, be sure to do that.

MFC after: 3 days


197513 26-Sep-2009 pjd

Use traverse() function to find and return mount point's vnode instead of
covered vnode when snapshot is already mounted.

MFC after: 3 days


197512 26-Sep-2009 pjd

- Don't depend on value returned by gfs_*_inactive(), it doesn't work
well with forced unmounts when GFS vnodes are referenced.
- Make other preparations to GFS for forced unmounts.

PR: kern/139062
Reported by: trasz
MFC after: 3 days


197497 25-Sep-2009 pjd

Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to
be a bit weak and OpenSolaris also switched to fletcher4.

PR: kern/139072
Reported by: Daniel Grund <bugs@dgrund.de>
MFC after: 3 days


197459 24-Sep-2009 pjd

Before calling vflush(FORCECLOSE) mark file system as unmounted so the
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.

The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.

MFC after: 3 days


197458 24-Sep-2009 pjd

Close race in zfs_zget(). We have to increase usecount first and then
check for VI_DOOMED flag. Before this change vnode could be reclaimed
between checking for the flag and increasing usecount.

MFC after: 3 days


197435 23-Sep-2009 trasz

In VOP_SETACL(9) and VOP_GETACL(9), specifying wrong ACL type should result
in EINVAL, not EOPNOTSUPP.


197426 23-Sep-2009 pjd

Restore BSD behaviour - when creating new directory entry use parent directory
gid to set group ownership and not process gid.

This was overlooked during v6 -> v13 switch.

PR: kern/139076
Reported by: Sean Winn <sean@gothic.net.au>
MFC after: 3 days


197351 20-Sep-2009 pjd

Purge namecache in the same place OpenSolaris does.


197289 17-Sep-2009 pjd

Purge file system namecache when receiving incremental stream and rolling back
to it.

MFC after: 3 days


197287 17-Sep-2009 pjd

Purge namecache for the file system being rolled back, so it doesn't point at
invalid vnodes after the rollback resulting in EIO errors when trying to access
files which are in the namecache.

Reported by: des
MFC after: 3 days


197219 15-Sep-2009 pjd

Forced unmounts work just fine in my tests under heavy load. There might
still be a problem, but it isn't worth a warning.


197218 15-Sep-2009 pjd

We believe ZFS is ready for production use. Remove a warning about it being
experimental. :)


197201 14-Sep-2009 pjd

- Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.

Reviewed by: kib
MFC after: 3 days


197177 13-Sep-2009 pjd

Support both case: when snapshot is already mounted and when it is not yet
mounted.

MFC after: 3 days


197172 13-Sep-2009 pjd

Add missing \n.

Reported by: marck


197167 13-Sep-2009 pjd

Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.

Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.

This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.

PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 3 days


197153 13-Sep-2009 pjd

When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.

MFC after: 3 days


197152 13-Sep-2009 pjd

Extend scope of the z_teardown_lock lock for consistency and "just in case".

MFC after: 3 days


197151 13-Sep-2009 pjd

Be sure not to overflow struct fid.

MFC after: 3 days


197150 13-Sep-2009 pjd

There is a bug where mze_insert() can trigger an assert() of inserting
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.

Reported by: marck
OpenSolaris bug: 6709336
MFC after: 3 days


197133 12-Sep-2009 pjd

- Protect reclaim with z_teardown_inactive_lock.
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().

MFC after: 1 week


197131 12-Sep-2009 pjd

Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
NULL, but also can point to dead vnode, take that into account.

PR: kern/132068
Reported by: Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 1 week


196985 08-Sep-2009 pjd

Only log successful commands! Without this fix we log even unsuccessful
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.

Reported by: Denis Ahrens <denis@h3q.com>
MFC after: 3 days


196982 08-Sep-2009 pjd

We don't export individual snapshots, so mnt_export field in snapshot's
mount point is NULL. That's why when we try to access snapshots over NFS
use mnt_export field from the parent file system.

MFC after: 1 week


196980 08-Sep-2009 pjd

When we automatically mount snapshot we want to return vnode of the mount point
from the lookup and not covered vnode. This is one of the fixes for using .zfs/
over NFS.

MFC after: 1 week


196979 08-Sep-2009 pjd

On FreeBSD we don't have to look for snapshot's mount point,
because fhtovp method is already called with proper mount point.

MFC after: 1 week


196978 08-Sep-2009 pjd

Call ZFS_EXIT() after locking the vnode.

MFC after: 1 week


196965 08-Sep-2009 pjd

Fix reference count leak for a case where snapshot's mount point is updated.
Such situation is not supported.

This problem was triggered by something like this:

# zpool create tank da0
# zfs snapshot tank@snap
# cd /tank/.zfs/snapshot/snap (this will mount the snapshot)
# cd
# mount -u nosuid /tank/.zfs/snapshot/snap (refcount leak)
# zpool export tank
cannot export 'tank': pool is busy

MFC after: 1 week


196954 07-Sep-2009 pjd

If we have to use avl_find(), optimize a bit and use avl_insert() instead of
avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).

Fix similar case in the code that is currently commented out.


196953 07-Sep-2009 pjd

When snapshot mount point is busy (for example we are still in it)
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.

This fixes a panic reproducable in the following steps:

# zfs create tank/foo
# zfs snapshot tank/foo@snap
# cd /tank/foo/.zfs/snapshot/snap
# umount /tank/foo
panic: avl_find() succeeded inside avl_add()

Reported by: trasz
MFC after: 3 days


196949 07-Sep-2009 trasz

Enable NFSv4 ACL support in ZFS.

Reviewed by: pjd


196944 07-Sep-2009 pjd

Don't recheck ownership on update mount. This will eliminate LOR between
vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.

Noticed by: kib
Reviewed by: kib
MFC after: 1 week


196941 07-Sep-2009 trasz

Prevent the line from wrapping.


196927 07-Sep-2009 pjd

Changing provider size is not really supported by GEOM, but doing so when
provider is closed should be ok.

When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.

PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
MFC after: 1 week


196919 07-Sep-2009 pjd

bzero() on-stack argument, so mutex_init() won't misinterpret that the
lock is already initialized if we have some garbage on the stack.

PR: kern/135480
Reported by: Emil Mikulic <emikulic@gmail.com>
MFC after: 3 days


196863 05-Sep-2009 trasz

Improve wording.

Discussed with: pjd, cperciva, rink, wkoszek and des, in order of appearance.


196703 31-Aug-2009 pjd

Backport the 'dirtying dbuf' panic fix from newer ZFS version.

Reported by: Thomas Backman <serenity@exscape.org>
MFC after: 1 week


196702 31-Aug-2009 pjd

Remove empty directory.


196662 30-Aug-2009 pjd

Add missing mountpoint vnode locking.

This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
regular user tries to mount dataset owned by him.

MFC after: 1 week


196458 23-Aug-2009 pjd

- Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
'vdev:worker da0' -> 'vdev da0'


196457 23-Aug-2009 pjd

Set priority of vdev_geom threads and zvol threads to PRIBIO.


196309 17-Aug-2009 pjd

getcwd() (when __getcwd() fails) works by stating current directory, going up
(..), calling readdir and looking for previous directory inode. In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.

Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.

This fixes /bin/pwd from within .zfs/snapshot/<name>/.

Suggested by: kib
Approved by: re (rwatson)


196307 17-Aug-2009 pjd

Manage asynchronous vnode release just like Solaris.

Discussed with: kmacy
Approved by: re (kib)


196303 17-Aug-2009 pjd

- Reduce z_teardown_lock lock scope a bit.
- The error variable is int, not bool.
- Convert spaces to tabs where needed.

Approved by: re (kib)


196301 17-Aug-2009 pjd

If z_buf is NULL, we should free znode immediately.

Noticed by: avg
Approved by: re (kib)


196299 17-Aug-2009 pjd

- We need to recycle vnode instead of freeing znode.

Submitted by: avg

- Add missing vnode interlock unlock.
- Remove redundant znode locking.

Approved by: re (kib)


196297 17-Aug-2009 pjd

Fix panic in zfs recv code. The last vnode (mountpoint's vnode) can have
0 usecount.

Reported by: Thomas Backman <serenity@exscape.org>
Approved by: re (kib)


196295 17-Aug-2009 pjd

Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by: re (kib)


196291 17-Aug-2009 pjd

- Fix a race where /dev/zfs control device is created before ZFS is fully
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
where one of the zpool/zfs command loads zfs.ko and tries to do some work
immediately, but /dev/zfs is not there yet.

Reported by: pav
Approved by: re (kib)


196289 17-Aug-2009 pjd

Remove files that are no longer used.

Discussed with: kmacy
Approved by: re (kib)


195909 27-Jul-2009 pjd

We don't support ephemeral IDs in FreeBSD and without this fix ZFS can
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.

PR: kern/132337
Submitted by: Matthew West <freebsd@r.zeeb.org>
Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by: re (kib)
MFC after: 2 weeks


195822 22-Jul-2009 trasz

Fix extattr_list_file(2) on ZFS in case the attribute directory
doesn't exist and user doesn't have write access to the file.
Without this fix, it returns bogus value instead of 0. For some
reason this didn't manifest on my kernel compiled with -O0.

PR: kern/136601
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Approved by: re (kib)


195785 20-Jul-2009 trasz

Fix permission handling for extended attributes in ZFS. Without
this change, ZFS uses SunOS Alternate Data Streams semantics - each
EA has its own permissions, which are set at EA creation time
and - unlike SunOS - invisible to the user and impossible to change.
From the user point of view, it's just broken: sometimes access
is granted when it shouldn't be, sometimes it's denied when
it shouldn't be.

This patch makes it behave just like UFS, i.e. depend on current
file permissions. Also, it fixes returned error codes (ENOATTR
instead of ENOENT) and makes listextattr(2) return 0 instead
of EPERM where there is no EA directory (i.e. the file never had
any EA).

Reviewed by: pjd (idea, not actual code)
Approved by: re (kib)


194586 21-Jun-2009 kib

Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson


194300 16-Jun-2009 jhb

Remove confusing mergeinfo caused by renaming files.


194118 13-Jun-2009 jamie

Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by: bz (mentor)


194043 11-Jun-2009 kmacy

pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous
differences with Opensolaris' ZFS

Sorry for the churn


193980 11-Jun-2009 kmacy

check against prefetch_enable


193953 10-Jun-2009 kmacy

use default policy for enabling prefetching unless the TUNABLE is set


193878 10-Jun-2009 kmacy

As far as I can tell systems that have less than 4GB are more often hurt
by prefetched than helped. On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.


193440 04-Jun-2009 ps

Support shared vnode locks for write operations when the offset is
provided on filesystems that support it. This really improves mysql
+ innodb performance on ZFS.

Reviewed by: jhb, kmacy, jeffr


193163 31-May-2009 dfr

Allow the bootfs property to be set for raidz pools on FreeBSD.

Reviewed by: pjd


193128 30-May-2009 kmacy

fix xdrmem_control to be safe in an if statement
fix zfs to depend on krpc
remove xdr from zfs makefile

Submitted by: dchagin@freebsd.org


193110 30-May-2009 kmacy

work around snapshot shutdown race reported by Henri Hennebert


192971 28-May-2009 kmacy

MFdevbranch 192944
- add FreeBSD implementation of xdrmem_control needed by zfs
- have zfs define xdr_ops using FreeBSD's definition
- remove solaris xdr files from zfs compile


192853 26-May-2009 sson

Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by: attilio jhb jb
Approved by: gnn (mentor)


192800 26-May-2009 trasz

MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by: pjd


192689 24-May-2009 trasz

Fix comment.


192360 19-May-2009 kmacy

- back out direct map hack
- it is no longer needed


192237 17-May-2009 kmacy

SAVESTART implies SAVENAME


192211 16-May-2009 kmacy

- allow forced unmounts
- don't assume snapshot was auto-mounted


192209 16-May-2009 kmacy

only use direct map if system has more than 2GB


192207 16-May-2009 kmacy

apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map


191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


191984 11-May-2009 kmacy

rename xdr support files to avoid conflicts when linking in to the kernel


191931 09-May-2009 kmacy

- rename atomic.S and crc32.c to avoid collisions when linking zfs in to the kernel
- update Makefile
- ifdef out acl_{alloc, free}, they aren't used by zfs and conflict with existing in-kernel routines


191907 07-May-2009 kmacy

don't call vn_rele_async_fini in the !_KERNEL case


191905 07-May-2009 kmacy

move VN_RELE_ASYNC to the compatibility layer with the rest of the VN_* defines


191903 07-May-2009 kmacy

avoid LOR and gratuitous extra lock acquisitions by moving user_evict list buffers to
a temporary list


191902 07-May-2009 kmacy

Allow the VM to provide backpressure on the ARC cache as it does
on Solaris.


191900 07-May-2009 kmacy

Asynchronously release vnodes to avoid blocking on range locks when calling back in to zfs.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.

This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.

The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.

Reviewed by: pjd
MFC after: 7 days


190888 10-Apr-2009 rwatson

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


190878 10-Apr-2009 thompsa

Revert r190676,190677

The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by: scottl


190676 03-Apr-2009 thompsa

Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.


189967 18-Mar-2009 jhb

The zfs_get_xattrdir() function is used to find the extended attribute
directory for a znode. When the directory already exists, it returns a
referenced but unlocked vnode. When a directory does not yet exist, it
calls zfs_make_xattrdir() to create a new one. zfs_make_xattrdir() returns
the vnode both referenced and and locked and zfs_get_xattrdir() was leaking
this vnode lock to its callers. Fix this by dropping the vnode lock if
zfs_make_xattrdir() successfully creates a new extended attribute
directory.

Reviewed by: pjd


189696 11-Mar-2009 jhb

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


188588 13-Feb-2009 jhb

Use shared vnode locks when invoking VOP_READDIR().

MFC after: 1 month


187830 28-Jan-2009 ed

Last step of splitting up minor and unit numbers: remove minor().

Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.


185614 04-Dec-2008 imp

Put the MIPS support back in after it was removed in r185029.


185321 25-Nov-2008 trasz

MFp4: We don't support TX_CREATE_ACL_ATTR nor TX_MKDIR_ACL_ATTR; code found
in zfs_replay.c will panic if it encounters transactions of this type.
Make sure we don't put these into the ZIL.

Approved by: rwatson (mentor), pjd


185319 25-Nov-2008 pjd

Fix locking (file descriptor table and Giant around VFS).

Most submitted by: kib
Reviewed by: kib


185174 22-Nov-2008 pjd

IFp4: Don't rely on disk IDs and always use vdev guids, which means always look
up for components by reading metadata. This might be slower when there are big
number of disks in the system, but is definiately more reliable.


185172 22-Nov-2008 pjd

IFp4: Finish implemnetation of chflags(2) for ZFS. While doing this I found
that zfs_access() can only handle VREAD, VWRITE and VEXEC, for the rest we need
to use vaccess(9).


185171 22-Nov-2008 pjd

IFp4: Don't free pathname too soon, debugging code is still using it.


185029 17-Nov-2008 pjd

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


184770 08-Nov-2008 trasz

Require write access on a directory being moved from one parent
directory to another in ZFS.

Approved by: rwatson (mentor), pjd


184740 06-Nov-2008 trasz

Backoff the last patch. It was overly restrictive - we want to check
for write permission on target only when moving the target between two
directories.

Approved by: rwatson (mentor)


184737 06-Nov-2008 trasz

Change ZFS behaviour to match UFS: when moving (rename(2)) a subdirectory
from one parent directory to another, in addition to the usual access checks
one also needs write access to the subdirectory being moved.

Approved by: rwatson (mentor), pjd


184698 05-Nov-2008 rodrigc

Merge latest DTrace changes from Perforce.


184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


183417 27-Sep-2008 jb

Disable use of the user credentials until there is code to set the levels
that DTrace uses.

This fixes a bug that would have affected kernels built with MAC and all
kernels built after the mpsafetty integration.

The bug will be apparent in RELENG7 on MAC kernels.

Reported by: kan


183397 27-Sep-2008 ed

Replace all calls to minor() with dev2unit().

After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.

This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.

Reviewed by: kib


183037 15-Sep-2008 pjd

Add missing ZFS_EXIT().

PR: kern/124899
Submitted by: Masakazu Asama <m-asama@ginzado.ne.jp>


182905 10-Sep-2008 trasz

Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by: rwatson (mentor)


182840 07-Sep-2008 pjd

Initialize vp, so we don't call VOP_UNLOCK() with NULL vnode pointer.

Confirmed by: marcus


182824 06-Sep-2008 pjd

Lock vnode exclusively around insmntque().


182781 05-Sep-2008 pjd

Catch up after last insmntque() changes:
- The vnode has to be locked exclusively before calling insmntque().
- Until I find a way to handle insmntque() failures use VV_FORCEINSMQ flag
to force insmntque() to always succeed.

Reported by: kris, trasz, des, others
Suggested by: kib
Tested by: trasz


182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


182031 23-Aug-2008 imp

Add MIPS support.

Reviewed by: jb@


181879 19-Aug-2008 jb

Add calls to callout_drain() to ensure the callouts are flushed before
we free memory from underneath them.

This fixes an occasional panic I've been seeing in softclock() where a bad
pointer would be encountered when pushing DTrace hard.


180660 21-Jul-2008 pjd

We want to use LBOLT instead of lbolt on FreeBSD.
I've this already fixed in p4, but the fix was never integrated into HEAD.

Reported by: ed


179758 12-Jun-2008 ed

Remove the $FreeBSD$ tag again, now I know fbsd:nokeywords exists.

Requested by: pjd
Approved by: philip (mentor)


179757 12-Jun-2008 ed

Turn dev2unit(), minor(), unit2minor() and minor2unit() into macro's.

Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.

The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.

Approved by: philip (mentor), pjd


179726 11-Jun-2008 ed

Don't enforce unique device minor number policy anymore.

Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.

Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.

This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.

The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.

Approved by: philip (mentor)


179469 01-Jun-2008 jb

Merge a recent change from the OpenSolaris source tree.
(Don't ask for a vendor import of this yet, we're in the early days of svn)

Instead of using cyclic timers to call the state clean and deadman callbacks,
use a callout on FreeBSD to avoid the deadlock on FreeBSD due to trying to
send interprocessor interrupts with interrupts disabled.

Reported by: ps, jhb, peter, thompsa


179310 25-May-2008 pjd

Fix namespace collision after src/sys/sys/file.h:1.78.


179307 25-May-2008 jb

Comment out the code that breaks with invariants. This is stuff that is
still WIP along with the lockstat provider, so there is no harm leaving
it out for now.


179280 24-May-2008 jb

Make the zfs module depend on the opensolaris module in preparation for it
to shared stuff with the DTrace modules.


179264 23-May-2008 jb

Delete a couple of OpenSolaris headers which get in the way of our
implementation.


179198 22-May-2008 jb

FreeBSD changes to vendor source.


179194 22-May-2008 jb

This commit was generated by cvs2svn to compensate for changes in r179193,
which included commits to RCS files with non-trunk default branches.


178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


178127 11-Apr-2008 marius

- Fix the path encoded in the multiple inclusion protection.
- GCC uses 32-byte function alignment for UltraSPARC CPUs.
- Remove code duplication.

Approved by: core, pjd
MFC after: 2 weeks


177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


177230 15-Mar-2008 pjd

Fix mmap(2) on ZFS after some changes in VM subsystem.

Submitted by: alc
Reported by: kris (originally) and many others
Tested with: fsx
MFC after: 1 week


176559 25-Feb-2008 attilio

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


175633 24-Jan-2008 pjd

- Reduce how much ZFS caches by default. This is another change to mitigate
'kmem_map too small panics'.
- Print two warnings if there is not enough memory and not enough address
space.
- Improve comment.


175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


174049 28-Nov-2007 jb

* Check endianness the FreeBSD way.

* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
variable.


174048 28-Nov-2007 jb

Fix a prototype definition.


173419 07-Nov-2007 pjd

Warn if kmem_map size is set to less than 512MB. Previous warning was a bit
pointless, because default is set to something around 300MB and also
insufficient.

MFC after: 3 days


173374 05-Nov-2007 pjd

Remove unused header.

MFC after: 3 days


173373 05-Nov-2007 pjd

If setting a state to anything but open state, close access to vdev.
This fixes replacing drive in place, eg. zpool replace tank da1 da1.
Before it complained that device is already open.

MFC after: 1 week


173268 02-Nov-2007 lulf

- Add sysctl for sizeof(znode_t), which will be used by fstat(1).

Approved by: pjd (mentor)


173250 01-Nov-2007 pjd

Call zil_commit() (if ZIL is not disabled) after every non-read request
(BIO_WRITE and BIO_FLUSH) as it is done is Solaris. The difference is
that Solaris calls it only for sync requests, but we can't say in GEOM
is the request is sync or async, so we do it for every request.

MFC after: 1 week


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172645 14-Oct-2007 thompsa

ZFS_LOG adds a newline by itself.

Pointed out by: pjd


172624 14-Oct-2007 thompsa

Print the ZFS ereport to the console if vfs.zfs.debug is set to help diagnose
problems with zfs-on-root since devd isnt running yet.

Reviewed by: pjd


172443 04-Oct-2007 pjd

Fix lock leak leading to the 'System call <name> returning with 1 locks held'
panic.

Reported by: kris
Approved by: re (kensmith)


172135 10-Sep-2007 pjd

Reduce the limit of vnodes on i386 when ZFS is loaded to 3/4 of the original
value, so we don't run out of KVA. The default vnodes limit fits better for
UFS, but ZFS allocated more file system specific memory for a vnode than UFS.

Don't touch vnodes limit if we detect it was tuned by system administrator
and restore original value when ZFS is unloaded.

This isn't final fix, but before we implement something better, this will
help to stabilize ZFS under heavy load on i386.

Approved by: re (bmah)


172130 10-Sep-2007 pjd

After dfr@ vnode leak fix, we can allow ARC to consume more memory.

Tested by: kris
Approved by: re (bmah)


172030 01-Sep-2007 pjd

Use CTLFLAG_RDTUN for tunable sysctls.

Approved by: re (bmah)


171567 24-Jul-2007 pjd

Update assertion after revision 1.23.

Reviewed by: dfr
Approved by: re (rwatson)


171316 09-Jul-2007 dfr

Correct a reference-counting mistake in the ZFS code which led to abnormal
memory usage and pessimal cache performance.

Reviewed by: pjd
Approved by: re (rwatson)


171063 27-Jun-2007 dfr

In zfs_vget, if we fail to translate an inode number to the corresponding
vnode, make sure we return an error code to the caller.

Reviewed by: pjd
Approved by: re


170431 08-Jun-2007 pjd

- Reduce number of atomic operations needed to be implemented in asm by
implementing some of them using existing ones.
- Allow to compile ZFS on all archs and use atomic operations surrounded
by global mutex on archs we don't have or can't have all atomic
operations needed by ZFS.


170281 04-Jun-2007 pjd

Reimplement traverse() helper function:
1. Pass locking flags to VFS_ROOT().
2. Check v_mountedhere while the vnode is locked.
3. Always return locked vnode on success.

Change 1 fixes problem reported by Stephen M. Rumble - after
zfs_vfsops.c,1.9 change, zfs_root() no longer locks the vnode
unconditionally and traverse() didn't pass right lock type to
VFS_ROOT(). The result was that kernel paniced when .zfs/ directory
was accessed via NFS.


170044 28-May-2007 pjd

Adjust va_mask for setattr. FreeBSD doesn't have va_mask, so we initialize it
based on individual fields beeing set. This doesn't work for setattr replay,
because va_type is set there, so we add AT_TYPE flag to va_mask, which won't
be accepted by zfs_setattr().

Reported by: kris


170040 28-May-2007 pjd

Because we allocate componentname structures on stack, bzero() them before
use just in case.


169929 24-May-2007 pjd

Initialize ZFS a bit earlier and block root mounting until
initialization is complete. This fixes some root-on-ZFS
configurations.

Reported by: Bruno Damour <freebsd.ruomad@free.fr>
Tested by: Bruno Damour <freebsd.ruomad@free.fr>


169920 23-May-2007 pjd

FreeBSD's namecache works quite well with ZFS, so remove DNLC.


169919 23-May-2007 pjd

All objects we create using GFS are directories, so initialize d_type
properly, but add XXX comment saying that it can eventually change in
the future.


169884 22-May-2007 pjd

Lock vnode on lookup. This fixes ZIL replay for rmdir/unlink/rename.

Reported by: des


169430 09-May-2007 pjd

Increase debug level - this message is not that important.


169325 06-May-2007 pjd

- Add missing lock destruction and remove duplicate initializations.
With this change it is possible to unload zfs.ko module from
WITNESS-enabled kernel.
- Remove bogus comment.


169303 06-May-2007 pjd

Use provider's ident to handle situations when disks are moved around
and show up with different names: first try to open provider using
remembered name and compare its ident, if equal, this is our provider,
if not equal or there is no provider with such name, find provider with
remembered ident and don't care about the name.


169302 06-May-2007 pjd

MFp4: We don't need to cover vnode_pager_setsize() with the z_map_lock.


169199 02-May-2007 pjd

Share-lock a vnode where possible.


169198 02-May-2007 pjd

When parent directory has to be unlocked, lock it back with the same lock
type. Before this change, if directory was shared-locked, it was relocked
exclusively.


169197 02-May-2007 pjd

Lock vnode using cn_lkflags in case the caller wants the vnode to be
shared-locked.


169196 02-May-2007 pjd

The getnewvnode() function sets LK_NOSHARE by default, so if we want to
support shared vnodes locking, we need to remove that flag.
Also add LK_CANRECURSE flag as found in nfsclient.


169195 02-May-2007 pjd

ZFS should update timestamps upon the creat() of an existing file.

Obtained from: OpenSolaris
Bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6465105


169194 02-May-2007 pjd

- Lock vnode with flags passed in as argument in zfs_vget() and zfs_root().

Pointed out by: ups
Also reported by: kris

- Add comments where I'm not sure if LK_RETRY should be used.


169172 01-May-2007 pjd

MFp4: Remove LK_RETRY flag when locking vnode in zfs_lookup, we don't want
dead vnodes here.

Suggested by: kib


169170 01-May-2007 pjd

White space fixes.


169167 01-May-2007 pjd

Add a comment explaining why we call dmu_write() unconditionally, even if
uiomove() fails, especially that it is different from what OpenSolaris
does (I'm not entirely sure they are right).

Suggested by: darrenr


169108 29-Apr-2007 pjd

- Define d_type for ".", ".." and ".zfs" directories.
- Add a TODO comment where d_type is still noe defined.


169107 29-Apr-2007 pjd

Oops, correct important typo in last commit.


169106 29-Apr-2007 pjd

Avoid freeing NULL pointer in case of an error.


169087 29-Apr-2007 pjd

Fix two use-after-free cases.


169059 26-Apr-2007 pjd

MFp4: Optimize mappedwrite() and mappedread() functions to write/read as much
non-mapped data as possible at once and not page-by-page. Which this change we
combain I/Os, but also saves many VM_OBJECT_UNLOCK()/VM_OBJECT_LOCK()
operations.

Simple 'fsx -l 33554432 -o 524288 -N 10000 /tank/fsx' test shows ~23%
performance increase.


169057 26-Apr-2007 pjd

- Always try to write one whole page at a time.
- vm_page_undirty() is enough (instead of vm_page_set_validclean()), but it has
to be called before we write the data in case someone makes page dirty after
our write, but before our vm_page_undirty() call.
- Always dmu_write, not matter if uiomove() succeeded, because it could
partially be ok and we would lose some changes.

All good ideas from: ups


169056 26-Apr-2007 pjd

MFV: Free znodes immediatelly, allowing the ARC to hold onto less memory.

Full description at: http://bugs.opensolaris.org/view_bug.do?bug_id=6543706


169055 26-Apr-2007 pjd

MFV: Functions name change.


169028 24-Apr-2007 pjd

ZIL (ZFS Intent Log) can be safely turned on and off at run time, because
it is only used when dataset is beeing mounted to decide if log should also
be opened.


169027 24-Apr-2007 pjd

MFp4: Now that ZFS can use FreeBSD's namecache, turn it off by default and
turn off DNLC, but don't remove DNLC yet just in case.


169025 24-Apr-2007 pjd

MFp4: Rearange the code so vobject is destroyed from reclaim() method like
in all other file system on FreeBSD (instead from inactive() method).

A nice side-effect of this change, except that it speedups file system
when mmaped file are often open/closed, is that it makes FreeBSD's
namecache work:)


169024 24-Apr-2007 pjd

MFp4: Once page is written successfully, we should clear the dirty bits.
This fixes slow operations on mmaped files, because without this fix,
pages were written to disk multiple times.

If one is looking for even greater speed up for such operation, he should
disable ZIL (by setting vfs.zfs.zil_disable to 1 in /boot/loader.conf).
Disabling ZIL makes fsx run ~9 times faster.


169023 24-Apr-2007 pjd

MFp4: Reduce diff against vendor.


169022 24-Apr-2007 pjd

MFp4: We have stronger 'lock already initialized' check now, so we can
reduce diff against the vendor by removing bzero of this mutex.


168987 23-Apr-2007 bmah

Mostly-cosmetic fixes in low-memory warning messages:

o Fix linewrap issues.

o Fix two typos (s/Recomended/Recommended/ and s/tunning/tuning/)

o Remove a couple of extra instances of the word "of".

o Update names of kmem_size variables.

Approved by: pjd


168978 23-Apr-2007 pjd

Too much diff reduction. 'cmd' has to be u_long.

Reported by: delphij


168962 23-Apr-2007 pjd

MFp4: Reduce diff against vendor code:
- Move FreeBSD-specific code to zfs_freebsd_*() functions in zfs_vnops.c
and keep original functions as similar to vendor's code as possible.
- Add various includes back, now that we have them.


168959 22-Apr-2007 pjd

Fix 'zpool status -v'. To get object number we should use ZFS_DIRENT_OBJ()
macro, as za_first_integer field also contains type. This should be fixed in
ZFS itself, but this bug is not visible on Solaris, because there, type is
not stored in za_first_integer. On the other hand it will be visible on
MacOS X.

Reported by: Barry Pederson <bp@barryp.org>


168958 22-Apr-2007 pjd

Fix st_rdev handling (implement it, actually).

Reported by: gj


168926 21-Apr-2007 pjd

MFp4:

@118370 Correct typo.

@118371 Integrate changes from vendor.

@118491 Show backtrace on unexpected code paths.

@118494 Integrate changes from vendor.

@118504 Fix sendfile(2). I had two ways of fixing it:
1. Fixing sendfile(2) itself to use VOP_GETPAGES() instead of
hacking around with vn_rdwr(UIO_NOCOPY), which was suggested
by ups.
2. Modify ZFS behaviour to handle this special case.

Although 1 is more correct, I've choosen 2, because hack from 1
have a side-effect of beeing faster - it reads ahead MAXBSIZE
bytes instead of reading page by page. This is not easy to implement
with VOP_GETPAGES(), at least not for me in this very moment.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>

@118525 Reorganize the code to reduce diff.

@118526 This code path is expected. It is simply when file is opened with
O_FSYNC flag.

Reported by: kris
Reported by: Michal Suszko <dry@dry.pl>


168839 18-Apr-2007 pjd

MFp4: We check for PRIV_VFS_MOUNT already in mount(2) syscall and we don't
want to do the check when snapshot is automatically mounted by an
unprivileged user doing lookup on a snapshot directory.


168826 17-Apr-2007 pjd

Simplify.


168821 17-Apr-2007 pjd

Ignore hostid check for root-on-ZFS configurations. Making hostid available
before the root is mounted is tricky and having it in /boot/ is not really
desire.

Reported by: Zephiris <zephiris@gmail.com>


168775 16-Apr-2007 pjd

Uncomment forgotten check. Without this check in-place, ZFS will panic on
unload instead of returning EBUSY. This check tells if there are mounted
ZFS file systems or not. We can't unload if there are mounted file systems.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>


168753 15-Apr-2007 pjd

MFp4: Start DNLC after desiredvnodes variable is initialized.
Before this change if zfs.ko was loaded by the loader, DNLC was
automatically disabled.

Reported by: Zephiris <zephiris@gmail.com>


168738 14-Apr-2007 pjd

Fix RAID-Z resilvering.

Obtained from: OpenSolaris


168724 14-Apr-2007 pjd

MFp4: Hmm, it seems to work now.


168715 14-Apr-2007 pjd

MFp4: Use max_ncpus, which is used in other places in the code.


168714 14-Apr-2007 pjd

MFp4: Add more debug, so we can see if zpool.cache was loaded or why it
wasn't loaded.


168713 14-Apr-2007 pjd

MFp4: Allow to tune vfs.zfs.debug from loader.conf.


168712 14-Apr-2007 pjd

MFp4: - Allow to tune number of spa_zio_* threads.
- Reduce default number of spa_zio_* threads to N*spa_zio_issue
plus N*spa_zio_intr threads per ZIO type, where N is the number
of CPUs.
- Put ZIO type number in thread's name.


168696 13-Apr-2007 pjd

Fix overflow, which was causing endless loops when 32bit machine had more
than 2GB of RAM. This was because our physmem is long and 'physmem*PAGESIZE'
can be negative for more than 2GB of memory.

Reported by: Andrey V. Elsukov <bu7cher@yandex.ru>

It is not yet tested by Andrey, so there can be other problems, but this
was definiately a bug, so I'm committing a fix now.


168683 13-Apr-2007 pjd

Fix vnodes starvation caused by DNLC (ZFS name cache):
- Tune number of namecache entires better (based on desiredvnodes).
- Handle vfs_lowvnodes event by releasing requested number of name cache
entries, but no less than 5%.

Reported by: simokawa


168676 12-Apr-2007 pjd

MFp4: Synchronize with vendor (mostly 'zfs rename -r').


168583 10-Apr-2007 pjd

MFp4: Allow to set zfs_recover via vfs.zfs.recover from /boot/loader.conf.


168582 10-Apr-2007 pjd

MFp4: Hide under '#ifdef _KERNEL' only what's really needed.


168566 10-Apr-2007 pjd

Try to stabilize ZFS with regard to memory consumption:
- Allow to shrink ARC down to 16MB (instead of 64MB).
- Set arc_max to 1/2 of kmem_map by default.
- Start freeing things earlier when low memory situation is detected.
- Serialize execution of arc_lowmem().

I decided to setup minimum ZFS memory requirements to 512MB of RAM and 256MB of
kmem_map size. If there is less RAM or kmem_map, a warning will be printed.
World is cruel, be no better. In other words: modern file system requires
modern hardware:)

From ZFS administration guide:

"Currently the minimum amount of memory recommended to install a Solaris
system is 512 Mbytes. However, for good ZFS performance, at least one
Gbyte or more of memory is recommended."


168565 10-Apr-2007 pjd

Reduce diff against vendor - we have now stronger check for "mutex already
initialized", so we can go back to kmem_alloc().


168559 09-Apr-2007 pjd

Remove unused #define.


168511 09-Apr-2007 pjd

We don't have to wait for the root file system to be mounted anymore, now that
kobj KPI supports operating on files loaded by the loader.


168510 09-Apr-2007 pjd

Drop the Giant lock before calling zfs_domount(), which is held when
mounting root file system.


168509 08-Apr-2007 pjd

Move zpool.cache from /etc/zfs/ to /boot/zfs/, so we can keep it on
dedicated /boot/ file system and use ZFS for the root file system.


168498 08-Apr-2007 pjd

MFp4: Synchronize with recent OpenSolaris changes.


168494 08-Apr-2007 pjd

- Use 'name=value' so it can be properly recognized by devd(8).
- Use only subclass as devd's type.


168488 08-Apr-2007 pjd

Take vnode pointer and hold it under znode lock, so we won't race with
zfs_reclaim(). This may or may not fix problem reported by kris, but it's
definiatelly better that way.


168482 07-Apr-2007 pjd

Move atomic.S files to directories that better fit OpenSolaris directory
layout.


168481 07-Apr-2007 pjd

Fix libzpool compilation.

Reported by: des


168478 07-Apr-2007 pjd

Limit the number of system taskq threads to the number of CPUs.
They are only used when there is a need for reducing namecache.

Observed by: kris, csjp


168474 07-Apr-2007 des

Fix some type mismatches.

Reviewed by: pjd@


168473 07-Apr-2007 pjd

Allow to tune maximum and minimum memory used by ARC.


168460 07-Apr-2007 pjd

Add missing mutex_init() which was causing assertion panic when on clone
destruction.

Reported by: kris


168404 06-Apr-2007 pjd

Please welcome ZFS - The last word in file systems.

ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.

I'd like to thank all SUN developers that created this great piece of software.

Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)