359554 |
02-Apr-2020 |
mav |
MFC r359112: MFOpenZFS: make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy.
The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk.
To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks.
External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865 openzfs/zfs@b8738257c2607c73c731ce8e0fd73282b266d6ef |
358313 |
25-Feb-2020 |
mav |
MFC r349381: Avoid extra taskq_dispatch() calls by DMU.
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do.
This change adds check for empty sublists to avoid this. |
353759 |
19-Oct-2019 |
avg |
MFC r353037: ZFS: add bookmark renaming |
347047 |
03-May-2019 |
mav |
MFC r346760: Fix minor mismerges.
No functional change. |
346686 |
25-Apr-2019 |
mav |
MFC r345200: MFV r336930: 9284 arc_reclaim_thread has 2 jobs
`arc_reclaim_thread()` calls `arc_adjust()` after calling `arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to indicate that we may no longer be `arc_is_overflowing()`.
The problem is, `arc_kmem_reap_now()` can take several seconds to complete, has no impact on `arc_is_overflowing()`, but due to how the code is structured, can impact how long the ARC will remain in the `arc_is_overflowing()` state.
The fix is to use seperate threads to:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`.
illumos/illumos-gate@de753e34f9c399037936e8bc547d823bba9d4b0d
Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Tim Kordas <tim.kordas@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Brad Lewis <brad.lewis@delphix.com> |
346676 |
25-Apr-2019 |
mav |
MFC r337594 (by mmacy): ZFS/MFV: Use cached feature info in spa_add_feature_stats()
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780 Author: Ned Bass <bass6@llnl.gov> Date: Thu Feb 26 12:24:11 2015 -0800
Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3082 |
346131 |
11-Apr-2019 |
mav |
MFC r344936: MFV/ZoL: Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks have twice the bandwidth of the inner tracks. However, it is detrimental on nonrotational media such as solid state disks, where the only effect is to ensure that metaslabs enter the best-fit allocation behavior sooner, which is detrimental to performance. It also makes no sense on files where the underlying filesystem can arrange things however it wants.
Author: Richard Yao <ryao@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3712 zfsonlinux/zfs@fb40095f5f0853946f8150481ca22602d1334dfe
To reduce code divergence this merge replaces equivalent but different FreeBSD code detecting non-rotating medium vdevs. |
339158 |
03-Oct-2018 |
mav |
MFC r337567 (by mmacy): Performance optimization of AVL tree comparator functions
MFV: commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f Author: Gvozden Neskovic <neskovic@gmail.com> Date: Sat Aug 27 20:12:53 2016 +0200
perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3:
old 6.85789 s new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals
perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level.
perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033 |
339147 |
03-Oct-2018 |
mav |
MFC r337229: Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the logical zio_read(), and then a physical zio. Currently, each of these results in a separate taskq_dispatch(zio_execute).
On high-read-iops workloads, this causes a significant performance impact. By processing all 3 ZIO's in a single taskq entry, we reduce the overhead on taskq locking and context switching. We accomplish this by allowing zio_done() to return a "next zio to execute" to zio_execute().
This results in a ~12% performance increase for random reads, from 96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-59292 Closes #7736
zfsonlinux/zfs@62840030a7dceaee013ddbcc1eebcfc7922edf7c |
339141 |
03-Oct-2018 |
mav |
MFC r337213: MFV r337212: 9465 ARC check for 'anon_size > arc_c/2' can stall the system
illumos/illumos-gate@abe1fd01ce5a83718c5a840daeab4abdaec1c104
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Don Brady <don.brady@delphix.com> |
339140 |
03-Oct-2018 |
mav |
MFC r337211: MFV r337210: 9577 remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can instead pass a flag down in a few places to prevent recursive dbuf eviction. Making this change has 3 benefits:
1. The code semantics are easier to understand. 2. On Linux, performance is improved, because creating/removing TSD values (by setting to NULL vs non-NULL) is expensive, and we do it very often. 3. According to Nexenta, the current semantics can cause a deadlock when concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they are working on a "parallel unmount" change that triggers this more easily)
illumos/illumos-gate@c2919acbea007fa95c709b60d073db9a24526e01
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> |
339131 |
03-Oct-2018 |
mav |
MFC r337191: MFV r337190: 9486 reduce memory used by device removal on fragmented pools
In the most fragmented real-world cases, this reduces memory used by the mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less fragmented cases will typically also see around 50-100MB of RAM per 1TB of storage.
illumos/illumos-gate@cfd63e1b1bcf7ba4bf72f55ddbd87ce008d2986d
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
339128 |
03-Oct-2018 |
mav |
MFC r337181: 9539 Make zvol operations use _by_dnode routines
Continues what was started in 7801 add more by-dnode routines by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols.
illumos/illumos-gate@8dfe5547fbf0979fc1065a8b6fddc1e940a7cf4f
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Rick McNeal <rick.mcneal@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Richard Yao <richard.yao@prophetstor.com> |
339126 |
03-Oct-2018 |
mav |
MFC r337177: MFV r337175: 9487 Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to exist after receiving the stream. We should free them accordingly, as if a freeobjects record for them had been included in the stream.
zfsonlinux/zfs@48fbb9ddbf2281911560dfbc2821aa8b74127315 illumos/illumos-gate@7864b8192b8d30471fa2240466d516292e5765b8
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> |
339125 |
03-Oct-2018 |
mav |
MFC r337172, MFV r337171: 9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes
Ideally we would like txg_kick() to get triggered only when we are sure that we are not syncing AND not quiescing any txg. This way we can kick an open TXG to the quiescing state when we are sure that there is nothing going on and we would benefit from the different states running concurrently.
illumos/illumos-gate@fa41d87de9ec9000964c605eb01d6dc19e4a1abe
Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> |
339120 |
03-Oct-2018 |
mav |
MFC r337169: MFV r337167: 9442 decrease indirect block size of spacemaps
Updates to indirect blocks of spacemaps can contribute significantly to write inflation. Therefore we want to reduce the indirect block size of spacemaps from 128K to 16K.
illumos/illumos-gate@221813c13b43ef48330b03725e00edee85108cf1
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Albert Lee <trisk@forkgnu.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
339111 |
03-Oct-2018 |
mav |
MFC r337007: MFV r336991, r337001: 9102 zfs should be able to initialize storage devices
The first access to a disk block can incur a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that volumes be "thick provisioned", where supported by the platform (VMware). Thick provisioning is time consuming and often is ignored. If the thick provision step is omitted, customers will see suboptimal performance until we have written to all parts of the LUN. ZFS should be able to initialize any unused storage to remove any first-write penalty that exists.
illumos/illumos-gate@094e47e980b0796b94b1b8f51f462a64d246e516
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com> |
339109 |
03-Oct-2018 |
mav |
MFC r336959: MFV r336958: 9337 zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much).
illumos/illumos-gate@adb52d9262f45a04318fc6e188fe2b7f59d989a5
Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> |
339108 |
03-Oct-2018 |
mav |
MFC r336956: MFV r336955: 9236 nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.
illumos/illumos-gate@21f7c81cc1156e9202ce3412d3ecaa697c3b2222
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> |
339106 |
03-Oct-2018 |
mav |
MFC r336951: MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy.
illumos/illumos-gate@3a4b1be953ee5601bab748afa07c26ed4996cde6
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> |
339105 |
03-Oct-2018 |
mav |
MFC r336949: MFV r336948: 9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads.
illumos/illumos-gate@f78cdc34af236a6199dd9e21376f4a46348c0d56
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Paul Dagnelie <pcd@delphix.com> |
339104 |
03-Oct-2018 |
mav |
MFC r336947: MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev).
The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps.
One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t.
illumos/illumos-gate@17f11284b49b98353b5119463254074fd9bc0a28
Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> |
339034 |
01-Oct-2018 |
sef |
MFC r334844, r336180, r336458
r334844
This originated from ZFS On Linux, as https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd
During scans (scrubs or resilvers), it sorts the blocks in each transaction group by block offset; the result can be a significant improvement. (On my test system just now, which I put some effort to introduce fragmentation into the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the changes.) I've seen similar rations on production systems.
r336180
Fix up some missed and mis-merges from the sequential scan code (r334844). Most of the changes involve moving some code around to reduce conflicts with future merges. One of the missing changes included a notification on scrub cancellation.
r336458
Fix a couple of typos in r334844 noticed by Richard Kojedzinszky
Approved by: mav Sponsored by: iXsystems, Inc |
338975 |
27-Sep-2018 |
mav |
MFC r334810 (by benno), r338205, r338206: r334810: Break recursion involving getnewvnode and zfs_rmnode.
When we're at our vnode limit, getnewvnode will call into the vnode LRU cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling represents something with extended attributes, zfs_rmnode will call zfs_zget which will attempt to allocate another vnode. If the next vnode we try to recycle is also a ZFS vnode representing something with extended attributes we can recurse further. This ends up being unbounded and can end up overflowing the stack.
In order to avoid this, restructure zfs_rmnode to simply add the extended attribute directory's object ID to the unlinked set, thus not requiring the allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain which will do the work of properly marking the vnodes for unlinking. zfs_unlinked_drain is also called on mount so these will be cleaned up there.
r338205: Create separate taskqueue to call zfs_unlinked_drain().
r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every deletion of a file with extended attributes. Using system_taskq for that with its multiple threads in case of multiple files deletion caused all available CPU threads to uselessly spin on busy locks, completely blocking the system.
Use of single dedicated taskqueue is the only easy solution I've found, while in would be great if we could specify that some task should be executed only once at a time, but never in parallel, while many tasks could use different threads same time.
r338206: Add dmu_tx_assign() error handling in zfs_unlinked_drain().
The error handling got lost during r334810, while according to the report error there may happen in case of dataset being over quota. In such case just leave the node in the unlinked list to be freed sometimes later. |
338484 |
05-Sep-2018 |
kib |
MFC r338370: Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. |
332785 |
19-Apr-2018 |
mav |
MFC r332523: 9433 Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1 the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely.
This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> |
332547 |
16-Apr-2018 |
mav |
MFC r331701: MFV r331695, 331700: 9166 zfs storage pool checkpoint
illumos/illumos-gate@8671400134a11c848244896ca51a7db4d0f69da4
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> |
332543 |
16-Apr-2018 |
mav |
MFC r331414: Reduce struct aggsum_bucket padding to fit into one cache line. |
332540 |
16-Apr-2018 |
mav |
MFC r331404: MFV r331400: 8484 Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> |
332537 |
16-Apr-2018 |
mav |
MFC r329802: MFV r329799, r329800: 9079 race condition in starting and ending condesing thread for indirect vdevs
illumos/illumos-gate@667ec66f1b4f491d5e839644e0912cad1c9e7122
The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning).
The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export.
Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com> |
332536 |
16-Apr-2018 |
mav |
MFC r329798: MFV r329793, r329795: 9075 Improve ZFS pool import/load process and corrupted pool recovery
illumos/illumos-gate@6f7938128a2c5e23f4b970ea101137eadd1470a1
Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes:
https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com> |
332530 |
16-Apr-2018 |
mav |
MFC r329765: MFV r329762: 8961 SPA load/import should tell us why it failed
illumos/illumos-gate@3ee8c80c747c4aa3f83351a6920f30c411236e1b
When we fail to open or import a storage pool, we typically don't get any additional diagnostic information, just "no pool found" or "can not import".
While there may be no additional user-consumable information, we should at least make this situation easier to debug/diagnose for developers and support. For example, we could start by using `zfs_dbgmsg()` to log each thing that we try when importing, and which things failed. E.g. "tried uberblock of txg X from label Y of device Z". Also, we could log each of the stages that we go through in `spa_load_impl()`.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> |
332525 |
16-Apr-2018 |
mav |
MFC r329732: MFV r329502: 7614 zfs device evacuation/removal
illumos/illumos-gate@5cabbc6b49070407fb9610cfe73d4c0e0dea3e77
https://www.illumos.org/issues/7614: This project allows top-level vdevs to be removed from the storage pool with “zpool remove”, reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now “indirect”) vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries become “obsolete” because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been “remapped” in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be “remapped” to their new (concrete) locations if possible. This process can be accelerated by using the “zfs remap” command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. Therefore, mirror and raidz devices can not be removed.
Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Prashanth Sreenivasa <pks@delphix.com> |
332524 |
16-Apr-2018 |
mav |
MFC r307317: MFV r307313: 5120 zfs should allow large block/gzip/raidz boot pool (loader project)
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Toomas Soome <tsoome@me.com>
openzfs/openzfs@c8811bd3e2427dddbac6c05a59cfe117d8fea370
FreeBSD still does not support booting from gzip-compressed datasets, so keep one chunk of this commit out. |
331611 |
27-Mar-2018 |
avg |
MFC r330974: MFV r330973: 9164 assert: newds == os->os_dsl_dataset
PR: 225877 |
331399 |
22-Mar-2018 |
mav |
MFC r329694: MFV r324198: 8081 Compiler warnings in zdb
illumos/illumos-gate@3f7978d02b206a6ebc5652c91aa9f42da6fbe00c https://github.com/illumos/illumos-gate/commit/3f7978d02b206a6ebc5652c91aa9f42da6fbe00c
https://www.illumos.org/issues/8081 zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD, which uses -WError, the only way to build it is to disable all compiler warnings. This makes it much harder to detect newly introduced bugs. We should cleanup all the warnings.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Alan Somers <asomers@gmail.com> |
331397 |
23-Mar-2018 |
mav |
MFC r329690: MFV r319737: 6939 add sysevents to zfs core for commands
illumos/illumos-gate@ce1577b04976f1d8bb5f235b6eaaab15b46a3068 https://github.com/illumos/illumos-gate/commit/ce1577b04976f1d8bb5f235b6eaaab15b46a3068
https://www.illumos.org/issues/6939 Originally created https://smartos.org/bugview/OS-4489 sysevents should be fired in the kernel from ZFS whenever a command is run that is logged in zpool history. Example output Terminal 1 root - gz sunos ~ # zfs create zones/foobar root - gz sunos ~ # zfs set quota=10g zones/foobar root - gz sunos ~ # zfs destroy zones/foobar Terminal 2 root - gz sunos ~ # sysevent EC_zfs nvlist version: 0 date = 2016-04-28T14:50:08.964Z vendor = SUNW publisher = zfs class = EC_zfs subclass = ESC_ZFS_history_event pid = 0 data = (embedded nvlist) nvlist version: 0 pool_name = zones pool_guid = 0x40c964e8f9a7a694 history_record = (embedded nvlist) nvlist version: 0 dsname = zones/foobar dsid = 0x1525 history internal str = internal_name = create history txg = 0x4c4ef3
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Joshua M. Clulow <jmc@joyent.com> Reviewed by: Josh Wilsdon <jwilsdon@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Alan Somers <asomers@gmail.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dave Eddy <dave@daveeddy.com> |
331395 |
22-Mar-2018 |
mav |
MFC r329681: MFV r318941: 7446 zpool create should support efi system partition
illumos/illumos-gate@7855d95b30fd903e3918bad5a29b777e765db821 https://github.com/illumos/illumos-gate/commit/7855d95b30fd903e3918bad5a29b777e765db821
https://www.illumos.org/issues/7446 Since we support whole-disk configuration for boot pool, we also will need whole disk support with UEFI boot and for this, zpool create should create efi- system partition. I have borrowed the idea from oracle solaris, and introducing zpool create - B switch to provide an way to specify that boot partition should be created. However, there is still an question, how big should the system partition be. For time being, I have set default size 256MB (thats minimum size for FAT32 with 4k blocks). To support custom size, the set on creation "bootsize" property is created and so the custom size can be set as: zpool create B - o bootsize=34MB rpool c0t0d0 After pool is created, the "bootsize" property is read only. When -B switch is not used, the bootsize defaults to 0 and is shown in zpool get output with value ''. Older zfs/zpool implementations are ignoring this property. https://www.illumos.org/rb/r/219/
Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Dan McDonald <danmcd@kebe.com> Author: Toomas Soome <tsoome@me.com>
This commit makes no sense for FreeBSD, that is why I blocked the option, but it should be good to stay closer to upstream. |
331384 |
22-Mar-2018 |
mav |
MFC r329625: MFV r307315: 7301 zpool export -f should be able to interrupt file freeing
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Author: Alek Pinchuk <alek@nexenta.com>
Closes #175 |
330991 |
15-Mar-2018 |
avg |
MFC r329363: read-behind / read-ahead support for zfs_getpages() |
330990 |
15-Mar-2018 |
avg |
MFC r329823: another rework of getzfsvfs / getzfsvfs_impl code |
330986 |
15-Mar-2018 |
avg |
MFC r329717: MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue |
330237 |
01-Mar-2018 |
avg |
MFC r329314: MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio
PR: 223803 |
329486 |
18-Feb-2018 |
mav |
MFC r328228: MFV r328227: 8909 8585 can cause a use-after-free kernel panic
illumos/illumos-gate@94ddd0900a8838f62bba15e270649a42f4ef9f81
https://www.illumos.org/issues/8909: There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo dule "zfs" due to a NULL pointer dereference dump content: kernel pages only
> $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic.
This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> |
329485 |
18-Feb-2018 |
mav |
MFC r328226: MFV r328225: 8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock"
illumos/illumos-gate@cf07d3da9915c0d22da8f59e991639f819463cef
https://www.illumos.org/issues/8603: To help make the ZIL's code more understandable, it was suggested that the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> |
329484 |
18-Feb-2018 |
mav |
MFC r328224: MFV r328220: 8677 Open-Context Channel Programs
illumos/illumos-gate@a3b2868063897ff0083dea538f55f9873eec981f
https://www.illumos.org/issues/8677 We want to be able to run channel programs outside of synching context. This would greatly improve performance of channel program that just gather information, as we won't have to wait for synching context anymore.
This feature should introduce the following: - A new command line flag in "zfs program" to specify our intention to run in open context. - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> |
325931 |
17-Nov-2017 |
avg |
MFC r325610: MFV r325609: 7531 Assign correct flags to prefetched buffers |
325541 |
08-Nov-2017 |
avg |
MFC r324195: MFV r323795: 8604 Avoid unnecessary work search in VFS when unmounting snapshots |
325538 |
08-Nov-2017 |
avg |
MFC r324197: MFV r323913: 8600 ZFS channel programs - snapshot |
325537 |
08-Nov-2017 |
avg |
MFC r324196: MFV r323912: 8592 ZFS channel programs - rollback |
325534 |
08-Nov-2017 |
avg |
MFC r324163: MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups
Also MFC-ed are the following fixes: - r324164 Add several new files to the files enabled by ZFS kernel option - r324178 unbreak kernel builds on sparc64 and powerpc - r324194 fix incorrect use of getzfsvfs_impl in r324163 - r324292 really unbreak kernel builds on sparc64 and powerpc64 |
325153 |
30-Oct-2017 |
avg |
MFC r324349: MFV r322235: 8067 zdb should be able to dump literal embedded block pointer |
325132 |
30-Oct-2017 |
avg |
MFC r324011, r324016: MFV r323535: 8585 improve batching done in zil_commit()
FreeBSD notes: - this MFV reverts FreeBSD commit r314549 to make the merge easier - at present our emulation of cv_timedwait_hires is rather poor, so I elected to use cv_timedwait_sbt directly Please see the differential revision for details. Unfortunately, I did not get any positive reviews, so there could be bugs in the FreeBSD-specific piece of the merge. Hence, the long MFC timeout.
illumos/illumos-gate@1271e4b10dfaaed576c08a812f466f6e81370e5e https://github.com/illumos/illumos-gate/commit/1271e4b10dfaaed576c08a812f466f6e81370e5e
https://www.illumos.org/issues/8585 The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch).
Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> |
324205 |
02-Oct-2017 |
avg |
MFC r323433,r323793,r323915: MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems, and followups
r323433: MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems r323793: MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure r323915: MFV r323914: 8661 remove "zil-cw2" dtrace probe |
324010 |
26-Sep-2017 |
avg |
MFC r323355: MFV r323107: 8414 Implemented zpool scrub pause/resume
illumos/illumos-gate@1702cce751c5cb7ead878d0205a6c90b027e3de8 https://github.com/illumos/illumos-gate/commit/1702cce751c5cb7ead878d0205a6c90b027e3de8
FreeBSD note: rather than merging the zpool.8 update I copied the zpool scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost verbatim. Now that the illumos page uses the mdoc format, it was an easier option. Perhaps the change is not in perfect compliance with the FreeBSD style, but I think that it is acceptible.
https://www.illumos.org/issues/8414 This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167 Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth.
Description
This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused.
Motivation and Context
Scrub pausing can be an I/O intensive operation and people have been asking for the ability to pause a scrub for a while. This allows one to preserve scrub progress while freeing up bandwidth for other I/O.
Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Alek Pinchuk <apinchuk@datto.com> |
324004 |
26-Sep-2017 |
avg |
MFC r323479,r323491: zfs: tighten debug versions of ZTOV and VTOZ
Sponsored by: Panzura |
323758 |
19-Sep-2017 |
avg |
MFC r322241: MFV r322240: 8491 uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS
illumos/illumos-gate@79c2b812ee2010ebf20fdd92dc5f06b59000a94c https://github.com/illumos/illumos-gate/commit/79c2b812ee2010ebf20fdd92dc5f06b59000a94c
https://www.illumos.org/issues/8491 The zpool checkpoint feature in DxOS added a new field in the uberblock. The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279). As these two changes come from two different sources and once upstreamed and deployed will introduce an incompatibility with each other we want to upstream a change that will reserve the padding for both of them so integration goes smoothly and everyone gets both features.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Olaf Faaland <faaland1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> |
323757 |
19-Sep-2017 |
avg |
MFC r322230: MFV r322229: 7600 zfs rollback should pass target snapshot to kernel
illumos/illumos-gate@77b171372ed21642e04c873ef1e87fe2365520df https://github.com/illumos/illumos-gate/commit/77b171372ed21642e04c873ef1e87fe2365520df
https://www.illumos.org/issues/7600 At present, the kernel side code seems to blindly rollback to whatever happens to be the latest snapshot at the time when the rollback task is processed. The expected target's name should be passed to the kernel driver and the sync task should validate that the target exists and that it is the latest snapshot indeed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> |
323749 |
19-Sep-2017 |
avg |
MFC r322233: MFV r322232: 8426 mark immutable buffer arguments as such in abd.h
illumos/illumos-gate@9b195260e22529ac0e2580faaf89402420589c1c https://github.com/illumos/illumos-gate/commit/9b195260e22529ac0e2580faaf89402420589c1c
https://www.illumos.org/issues/8426 abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so qualify them with const. abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the corresponding arguments.
Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> |
321614 |
27-Jul-2017 |
mav |
MFC r320352: zfs: port vdev_file part of illumos change 3306
3306 zdb should be able to issue reads in parallel illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306
The upstream change was made before we started to import upstream commits individually. It was imported into the illumos vendor area as r242733. That commit was MFV-ed in r260138, but as the commit message says vdev_file.c was left intact.
This commit actually implements the parallel I/O for vdev_file using a taskqueue with multiple thread. This implementation does not depend on the illumos or FreeBSD bio interface at all, but uses zio_t to pass around all the relevent data. So, the code looks a bit different from the upstream.
This commit also incorporates ZoL commit zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed https://github.com/zfsonlinux/zfs/issues/2270 We need to use a dedicated taskqueue for exactly the same reason as ZoL as we do not implement TASKQ_DYNAMIC.
Obtained from: illumos, ZFS on Linux |
321611 |
27-Jul-2017 |
mav |
MFC r320237: MFV r318947: 7578 Fix/improve some aspects of ZIL writing.
FreeBSD note: this commit removes small differences between what mav committed to FreeBSD in r308782 and what ended up committed to illumos after addressing all review comments.
illumos/illumos-gate@c5ee46810f82e8a53d2cc5a487568a573f449039 https://github.com/illumos/illumos-gate/commit/c5ee46810f82e8a53d2cc5a487568a573f449039
https://www.illumos.org/issues/7578 After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alexander Motin <mav@FreeBSD.org> |
321610 |
27-Jul-2017 |
mav |
MFC r320156, r320185, r320186, r320262, r320452, r321111: MFV r318946: 8021 ARC buf data scatter-ization
illumos/illumos-gate@770499e185d15678ccb0be57ebc626ad18d93383 https://github.com/illumos/illumos-gate/commit/770499e185d15678ccb0be57ebc626ad1 8d93383
https://www.illumos.org/issues/8021 The ARC buf data project (known simply as "ABD" since its genesis in the ZoL community) changes the way the ARC allocates `b_pdata` memory from using linea r `void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This improves ZFS's performance by helping to defragment the address space occupied by the ARC, in particular for cases where compressed ARC is enabled. It could also ease future work to allocate pages directly from `segkpm` for minimal- overhead memory allocations, bypassing the `kmem` subsystem. This is essentially the same change as the one which recently landed in ZFS on Linux, although they made some platform-specific changes while adapting this work to their codebase: 1. Implemented the equivalent of the `segkpm` suggestion for future work mentioned above to bypass issues that they've had with the Linux kernel memory allocator. 2. Changed the internal representation of the ABD's scatter/gather list so it could be used to pass I/O directly into Linux block device drivers. (This feature is not available in the illumos block device interface yet.)
FreeBSD notes: - the actual (default) chunk size is 4KB (despite the text above saying 1KB) - we can try to reimplement ABDs, so that they are not permanently mapped into the KVA unless explicitly requested, especially on platforms with scarce KVA - we can try to use unmapped I/O and avoid intermediate allocation of a linear, virtual memory mapped buffer - we can try to avoid extra data copying by referring to chunks / pages in the original ABD
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Dan Kimmel <dan.kimmel@delphix.com> |
321609 |
27-Jul-2017 |
mav |
MFC r320153: revert r315852 which introduced zio_buf_alloc_nowait for use in vdev_queue_aggregate
I think that the change is still good, but reconciling it with a planned merge of the ARC buf data scatter-ization is a bit more tedious than I can handle. |
321578 |
26-Jul-2017 |
mav |
MFC r319949: MFV r319948: 5428 provide fts(), reallocarray(), and strtonum()
illumos/illumos-gate@4585130b259133a26efae68275dbe56b08366deb https://github.com/illumos/illumos-gate/commit/4585130b259133a26efae68275dbe56b08366deb
https://www.illumos.org/issues/5428
Most of the upstream change is not applicable to FreeBSD. Only the renaming of strtonum to zfs_strtonum is relevant to us. And we already had it partially done.
Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> |
321573 |
26-Jul-2017 |
mav |
MFC r319748: MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers
illumos/illumos-gate@adaec86ad212d9fd756bee322934fa54d1258605 https://github.com/illumos/illumos-gate/commit/adaec86ad212d9fd756bee322934fa54d1258605
https://www.illumos.org/issues/8155 When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We can simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321570 |
26-Jul-2017 |
mav |
MFC r318945: MFV r318944: 8265 Reserve send stream flag for large dnode feature
illumos/illumos-gate@bc83969fdbd1cb0d97ba00218c0a3de5c89fba92 https://github.com/illumos/illumos-gate/commit/bc83969fdbd1cb0d97ba00218c0a3de5c 89fba92
https://www.illumos.org/issues/8265 Reserve bit 23 in the zfs send stream flags for the large dnode feature which has been implemented for Linux.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Brian Behlendorf <behlendorf1@llnl.gov> |
321567 |
26-Jul-2017 |
mav |
MFC r318932: MFV r318931: 8063 verify that we do not attempt to access inactive txg
illumos/illumos-gate@b7b2590dd9f11b12a0b4878db3886068cce176af https://github.com/illumos/illumos-gate/commit/b7b2590dd9f11b12a0b4878db3886068cce176af
https://www.illumos.org/issues/8063 A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321554 |
26-Jul-2017 |
mav |
MFC r318829: MFV r316920: 8023 Panic destroying a metaslab deferred range tree
illumos/illumos-gate@3991b535a8e990c0369be677746a87c259b13e9f https://github.com/illumos/illumos-gate/commit/3991b535a8e990c0369be677746a87c259b13e9f
https://www.illumos.org/issues/8023 $C ffffff0011bc0970 vpanic() ffffff0011bc0a00 strlog() ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00) ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380) ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800) ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000) ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0) ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000) ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000) ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190) ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()
Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> |
321553 |
26-Jul-2017 |
mav |
MFC r318828: MFV r316917: 7968 multi-threaded spa_sync()
illumos/illumos-gate@94c2d0eb22e9624151ee84a7edbf7178e1bf4087 https://github.com/illumos/illumos-gate/commit/94c2d0eb22e9624151ee84a7edbf7178e1bf4087
https://www.illumos.org/issues/7968 spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. On the concurrent file creation microbenchmark, the performance improvement from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/ reward, once the other bottlenecks are addressed, fixing this bug will provide a medium-large performance gain and require a medium amount of effort to implement.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321552 |
26-Jul-2017 |
mav |
MFC r318827: MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists
illumos/illumos-gate@10fbdecb05f411234920f8d3c92c148d39106d7e https://github.com/illumos/illumos-gate/commit/10fbdecb05f411234920f8d3c92c148d39106d7e
https://www.illumos.org/issues/7970 The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321549 |
26-Jul-2017 |
mav |
MFC r318823: MFC r316914: 7801 add more by-dnode routines
illumos/illumos-gate@b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe https://github.com/illumos/illumos-gate/commit/b0c42cd4706ba01ce158bd2bb1004f7e59eca5fe
https://www.illumos.org/issues/7801 Add *_by_dnode() routines for accessing objects given their dnode_t *, this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: https://github.com/zfsonlinux/zfs/commit/0eef1bde31d67091d3deed23fe2394f5a8bf2276
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: bzzz77 <bzzz.tomas@gmail.com> |
321547 |
26-Jul-2017 |
mav |
MFC r318821: MFV r316912: 7793 ztest fails assertion in dmu_tx_willuse_space
illumos/illumos-gate@61e255ce7267b52208af9daf434b77d37fb75622 https://github.com/illumos/illumos-gate/commit/61e255ce7267b52208af9daf434b77d37 fb75622
https://www.illumos.org/issues/7793 Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_*() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_* accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_*()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_*(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321545 |
26-Jul-2017 |
mav |
MFC r318818: MFV r316907: 1300 filename normalization doesn't work for removes
illumos/illumos-gate@1c17160ac558f98048951327f4e9248d8f46acc0 https://github.com/illumos/illumos-gate/commit/1c17160ac558f98048951327f4e9248d8f46acc0
https://www.illumos.org/issues/1300
FreeBSD note: recent FreeBSD was not affected by the issue fixed as the name cache is completely bypassed when normalization is enabled. The change is imported for the sake of ZAP infrastructure modifications.
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> |
321541 |
26-Jul-2017 |
mav |
MFC r317541: MFV 316905
7740 fix for 6513 only works in hole punching case, not truncation
illumos/illumos-gate@7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553 https://github.com/illumos/illumos-gate/commit/7de35a3ed0c2e6d4256bd2fb05b48b3798aaf553
https://www.illumos.org/issues/7740 The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. To verify, create a large file, truncate it to a short length, and then write beyond the end. Check with zdb to make sure that there are no holes with birth time zero (will appear as gaps).
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com> |
321540 |
26-Jul-2017 |
mav |
MFC r317533: MFV 316900
7743 per-vdev-zaps have no initialize path on upgrade
illumos/illumos-gate@555da5111b0f2552c42d057b211aba89c9c79f6c https://github.com/illumos/illumos-gate/commit/555da5111b0f2552c42d057b211aba89c9c79f6c
https://www.illumos.org/issues/7743 When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs.
Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> |
321539 |
26-Jul-2017 |
mav |
MFC r317527: MFV 316898
7613 ms_freetree[4] is only used in syncing context
illumos/illumos-gate@5f145778012b555e084eacc858ead9e1e42bd149 https://github.com/illumos/illumos-gate/commit/5f145778012b555e084eacc858ead9e1e42bd149
https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg).
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> |
321538 |
26-Jul-2017 |
mav |
MFC r317522: MFV 316897
7586 remove #ifdef __lint hack from dmu.h
illumos/illumos-gate@4ba5b9616327ef64e8abc737d29b3faabc6ae68c https://github.com/illumos/illumos-gate/commit/4ba5b9616327ef64e8abc737d29b3faabc6ae68c
https://www.illumos.org/issues/7586 The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if we add other inline functions into header files in ZFS, especially since it is difficult to make any other solution work across all compilation targets. We should switch to disabling the lint flags that are failing instead.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com> |
321535 |
26-Jul-2017 |
mav |
MFC r317414: MFV 316894
7252 7628 compressed zfs send / receive
illumos/illumos-gate@5602294fda888d923d57a78bafdaf48ae6223dea https://github.com/illumos/illumos-gate/commit/5602294fda888d923d57a78bafdaf48ae6223dea
https://www.illumos.org/issues/7252 This feature includes code to allow a system with compressed ARC enabled to send data in its compressed form straight out of the ARC, and receive data in its compressed form directly into the ARC.
https://www.illumos.org/issues/7628 We should have longer, more readable versions of the ZFS send / recv options.
7628 create long versions of ZFS send / receive options
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com> |
321531 |
26-Jul-2017 |
mav |
MFC r317235: MFV 316868
7430 Backfill metadnode more intelligently
illumos/illumos-gate@af346df58864e8fe897b1ff1a3a4c12f9294391b https://github.com/illumos/illumos-gate/commit/af346df58864e8fe897b1ff1a3a4c12f9 294391b
https://www.illumos.org/issues/7430 Description and patch from brought over from the following ZoL commit: https:/ / github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6 Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don?t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can?t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG.
Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Ned Bass <bass6@llnl.gov> |
321530 |
26-Jul-2017 |
mav |
MFC r316037: MFV: 315989
7603 xuio_stat_wbuf_* should be declared (void)
illumos/illumos-gate@99aa8b55058e512798eafbd71f72f916bdc10181 https://github.com/illumos/illumos-gate/commit/99aa8b55058e512798eafbd71f72f916bdc10181
https://www.illumos.org/issues/7603
The funcs are declared k&r style, where the args are not specified:
void xuio_stat_wbuf_copied(); They should be declared to take no arguments:
void xuio_stat_wbuf_copied(void); Need to change both .c and .h.
Author: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net> |
321529 |
26-Jul-2017 |
mav |
MFC r315896: MFV r315290, r315291: 7303 dynamic metaslab selection
illumos/illumos-gate@8363e80ae72609660f6090766ca8c2c18aa53f0c https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18
https://www.illumos.org/issues/7303
This change introduces a new weighting algorithm to improve metaslab selection . The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a res ult, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based).
This also introduce a new allocation tracing facility and two new dcmds to hel p debug allocation problems. Each zio now contains a zio_alloc_list_t structure that is populated as the zio goes through the allocations stage. Here's an example of how to use the tracing facility:
> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 ENOSPC ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 ENOSPC ztest.0a 1 0 400 1 x 8M 17b1a00 ztest.0a
> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 200 0 NOT_ALLOCATABLE mirror-2 - 0 200 0 NOT_ALLOCATABLE mirror-0 1 0 200 1 x 4M 112ae00 mirror-1 - 1 200 0 NOT_ALLOCATABLE mirror-2 - 1 200 0 NOT_ALLOCATABLE mirror-0 1 1 200 1 x 4M 112b000 mirror-1 - 2 200 0 NOT_ALLOCATABLE mirror-2
If the metaslab is using segment-based weighting then the WEIGHT column will display the number of segments available in the bucket where the allocation attempt was made.
Author: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Richard Lowe <richlowe@richlowe.net> |
321527 |
26-Jul-2017 |
mav |
MFC r314267: MFV 314243
6676 Race between unique_insert() and unique_remove() causes ZFS fsid change
illumos/illumos-gate@40510e8eba18690b9a9843b26393725eeb0f1dac https://github.com/illumos/illumos-gate/commit/40510e8eba18690b9a9843b26393725ee b0f1dac
https://www.illumos.org/issues/6676
The fsid of zfs filesystems might change after reboot or remount. The problem seems to be caused by a race between unique_insert() and unique_remove(). The unique_re move() is called from dsl_dataset_evict() which is now an asynchronous thread. In a c ase the dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end up with changed fsid on zfs mount.
This problem is very likely caused by #5056.
Steps to Reproduce Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore machines it is not so easy to reproduce.
# uname -a SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris # zfs create rpool/TEST # FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}') # echo $FS::print vfs_t vfs_fsid | mdb -k vfs_fsid = { vfs_fsid.val = [ 0x54d7028a, 0x70311508 ] } # zfs umount rpool/TEST # zfs mount rpool/TEST # FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}') # echo $FS::print vfs_t vfs_fsid | mdb -k vfs_fsid = { vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ] } #
Impact The persistent fsid (filesystem id) is essential for proper NFS functionality. If the fsid of a filesystem changes on remount (or after reboot) the NFS clients might not be able to automatically recover from such event and the manual remount of the NFS filesystems on every NFS client might be needed.
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> |
321524 |
26-Jul-2017 |
mav |
MFC r313813: MFV 313786
7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid
illumos/illumos-gate@653af1b809998570c7e89fe7a0d3f90992bf0216 https://github.com/illumos/illumos-gate/commit/653af1b809998570c7e89fe7a0d3f90992bf0216
https://www.illumos.org/issues/7500 With the integration of:
commit 0f6d88aded0d165f5954688a9b13bac76c38da84 Author: Alex Reece <alex@delphix.com> Date: Sat Jul 26 13:40:04 2014 -0800 4873 zvol unmap calls can take a very long time for larger datasets
the dnode's dn_bufs field was changed from a list to a tree. As a result, the dn_unlisted_l0_blkid field is no longer necessary.
Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> |
321523 |
26-Jul-2017 |
mav |
MFC r312535: MFV 312436
6569 large file delete can starve out write ops
illumos/illumos-gate@ff5177ee8bf9a355131ce2cc61ae2da6a5a6fdd6 https://github.com/illumos/illumos-gate/commit/ff5177ee8bf9a355131ce2cc61ae2da 6a5a6fdd6
https://www.illumos.org/issues/6569 The core issue I've found is that there is no throttle for how many deletes get assigned to one TXG. As a results when deleting large files we end up filling consecutive TXGs with deletes/frees, then write throttling other (more important) ops.
There is an easy test case for this problem. Try deleting several large files (at least 1/2 TB) while you do write ops on the same pool. What we've seen is performance of these write ops (let's call it sideload I/O) would drop to zero.
More specifically the problem is that dmu_free_long_range_impl() can/will fill up all of the dirty data in the pool "instantly", before many of the sideload ops can get in. So sideload performance will be impacted until all the files are freed.
The solution we have tested at Nexenta (with positive results) creates a relatively simple throttle for how many "free" ops we let into one TXG.
However this solution exposes other problems that should also be addressed. If we are to slow down freeing of data that means one has to wait even longer (assuming vnode ref count of 1) to get shell back after an rm or for NFS thread to finish the free-ing op. To avoid this the proposed solution is to call zfs_inactive() async for "large" files. Async freeing then begs for the reclaimed space to be accounted for in the zpool's "freeing" prop.
The other issue with having a longer delete is the inability to export/unmount for a longer period of time. The proposed solution is to interrupt freeing of blocks when a fs is unmounted.
Author: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> |
316849 |
14-Apr-2017 |
avg |
MFC r315852: zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate |
315842 |
23-Mar-2017 |
avg |
MFC r314048,r314194: reimplement zfsctl (.zfs) support |
315441 |
17-Mar-2017 |
mav |
MFC r308782: After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z*_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case.
While there, untangle and unify code of z*_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc. |
314665 |
04-Mar-2017 |
avg |
MFC r314273: zfs: call spa_deadman on a taskqueue thread |
310511 |
24-Dec-2016 |
avg |
MFC r309098: MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try to free already free blocks |
310509 |
24-Dec-2016 |
avg |
MFC r309097: MFV r308987: 7180 potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename |
308914 |
21-Nov-2016 |
avg |
MFC r308089: zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot |
308595 |
13-Nov-2016 |
mav |
MFC r308173: Fix ZIL records ordering when ZVOL opened both with and without FSYNC.
Before this an earlier writes to a ZVOL opened without FSYNC could get to ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt), marking all log records sync if anybody opened the ZVOL with FSYNC. |
308082 |
29-Oct-2016 |
mav |
MFC r306424: MFV r306422: 7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs
dsl_dataset_space is looking at the ds_bp's fill count while dmu_objset_write_ready() is concurrently modifying it. This fix adds an rrwlock to protect the ds_bp.
Closes #180
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Author: Paul Dagnelie <pcd@delphix.com> |
307995 |
27-Oct-2016 |
avg |
MFC r306801: implement zfs_vptocnp() using z_parent property |
307299 |
14-Oct-2016 |
mav |
MFC r305563: MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of this patch:
commit ca0cc3918a1789fa839194af2a9245f801a06b1a Author: Matthew Ahrens <mahrens@delphix.com> Date: Fri Jul 24 09:53:55 2015 -0700
5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net>
This patch simply removes this macro from dsl_dataset.h.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307290 |
14-Oct-2016 |
mav |
MFC r305340: MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created:
dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write()
This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds.
Sponsored by: Intel Corp.
Closes #109
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@d3e523d489a169ab36f9ec1b2a111a60a5563a9f |
307286 |
14-Oct-2016 |
mav |
MFC r305338: MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Closes #108
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@0780b3eab5a2c13e04328b39ecd2a6d0d3c4f7cb |
307284 |
14-Oct-2016 |
mav |
MFC r305334: MFV r304157: 7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records
illumos/illumos-gate@12b90ee2d3b10689fc45f4930d2392f5fe1d9cfa https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f e1d9cfa
https://www.illumos.org/issues/7230 A test failure occurred where a send stream had only a BEGIN record. This should not be possible if the send returns without error. Prevented this from happening in the future by adding an assertion to dmu_send_impl() to verify that if the function returns 0 (success) both a BEGIN and END record are present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o r END records sent), having dump_record() set flags appropriately, adding VERIFY statement to dmu_send_impl().
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matt Krantz <matt.krantz@delphix.com> |
307282 |
14-Oct-2016 |
mav |
MFC r305333: MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr
illumos/illumos-gate@bd56f80007857b960e0981ed0797ad8ec844a96b https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec 844a96b
https://www.illumos.org/issues/7235 The function dsl_dataset_set_blkptr() is unused. We should remove it.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307277 |
14-Oct-2016 |
mav |
MFC r305331: MFV r304155: 7090 zfs should improve allocation order and throttle allocations
illumos/illumos-gate@0f7643c7376dd69a08acbfc9d1d7d548b10c846a https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b 10c846a
https://www.illumos.org/issues/7090 When write I/Os are issued, they are issued in block order but the ZIO pipelin e will drive them asynchronously through the allocation stage which can result i n blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able t o detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocatio n is successfully completed it's scheduled on a given top-level vdev. Each top- level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied.
Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: George Wilson <george.wilson@delphix.com> |
307265 |
14-Oct-2016 |
mav |
MFC r305323: MFV r302991: 6950 ARC should cache compressed data
illumos/illumos-gate@dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2 https://github.com/illumos/illumos-gate/commit/dcbf3bd6a1f1360fc1afcee9e22c6dcff7844bf2
https://www.illumos.org/issues/6950 When reading compressed data from disk, the ARC should keep the compressed block cached and only decompress it when consumers access the block. The uncompressed data should be short-lived allowing the ARC to cache a much larger amount of data. The DMU would also maintain a smaller cache of uncompressed blocks to minimize the impact of decompressing frequently accessed blocks.
Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com> |
307112 |
12-Oct-2016 |
mav |
MFC r305222: MFV r302993: 7104 increase indirect block size
illumos/illumos-gate@4b5c8e93cab28d3c65ba9d407fd8f46e3be1db1c https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3 be1db1c
https://www.illumos.org/issues/7104 The current default indirect block size is 16KB. We can improve performance by increasing it to 128KB. This is especially helpful for any workload that needs to read most of the metadata, e.g. scrub/resilver, file deletion, filesystem deletion, and zfs send. We also need to fix a few space estimation errors to make the tests pass.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307108 |
12-Oct-2016 |
mav |
MFC r305209: MFV r302660: 6314 buffer overflow in dsl_dataset_name
illumos/illumos-gate@9adfa60d484ce2435f5af77cc99dcd4e692b6660 https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6 92b6660
https://www.illumos.org/issues/6314 Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but dsl_dataset_name copies the datasets' name PLUS the snapshot name to it, resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307101 |
12-Oct-2016 |
mav |
MFC r305197: MFV r302646: 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
illumos/illumos-gate@ea4a67f462de0a39a9adea8197bcdef849de5371 https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84 9de5371
https://www.illumos.org/issues/6980 doing zfs send -i snap1 snap2 >testfile results in internal error: Invalid argument Abort (core dumped)
Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307049 |
11-Oct-2016 |
mav |
MFC r305200: MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space
illumos/illumos-gate@0c779ad424a92a84d1e07d47cab7f8009189202b https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009 189202b
https://www.illumos.org/issues/7054 upstream: ee0003de7d3e598499be7ac3fe6b61efcc47cb7f DLPX-40399 dmu_tx_hold_t should use refcount_t to track space
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> |
307046 |
11-Oct-2016 |
mav |
MFC r305195: MFV r302643: 6902 speed up listing of snapshots if requesting name only and sorting by name
This was our change from the beginning, so just reduce the upstream diff. |
304138 |
15-Aug-2016 |
avg |
MFC r302838: 6513 partially filled holes lose birth time |
304137 |
15-Aug-2016 |
avg |
MFC r302837: 6844 dnode_next_offset can detect fictional holes |
303970 |
11-Aug-2016 |
avg |
MFC r303763,303791,303869: zfs: honour and make use of vfs vnode locking protocol
ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS.
The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS.
This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS.
Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode.
Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code.
To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support
PR: 209158 Approved by: re (gjb) |
302408 |
08-Jul-2016 |
gjb |
Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here.
Additional commits post-branch will follow.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
301010 |
31-May-2016 |
allanjude |
Connect the SHA-512t256 and Skein hashing algorithms to ZFS
Support for the new hashing algorithms in ZFS was introduced in r289422 However it was disconnected because FreeBSD lacked implementations of SHA-512 (truncated to 256 bits), and Skein.
These implementations were introduced in r300921 and r300966 respectively
This commit connects them to ZFS and enabled these new checksum algorithms
This new algorithms are not supported by the boot blocks, so do not use them on your root dataset if you boot from ZFS.
Relnotes: yes Sponsored by: ScaleEngine Inc.
|
300132 |
18-May-2016 |
avg |
zfsctl: tighten an assertion and remove an unused definition
There are only two entries under .zfs and 'shares' has an ID of a special persistent object in its filesystem.
MFC after: 1 week
|
299441 |
11-May-2016 |
mav |
MFV r299440: 6736 ZFS per-vdev ZAPs
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Joe Stein <joe.stein@delphix.com>
openzfs/openzfs@215198a6ad15cf4832370e2f19247abeb36b951a
|
297832 |
11-Apr-2016 |
mav |
MFV r297831: 6322 ZFS indirect block predictive prefetch
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Author: Alexander Motin <mav@FreeBSD.org>
Improve speculative prefetch of indirect blocks.
Scalability of many operations on wide ZFS pool can be limited by requirement to prefetch indirect blocks first. Recently added asynchronous indirect block read partially helped, but did not solve the problem completely. This patch extends existing prefetcher functionality to explicitly work with indirect blocks.
Before this change prefetcher issued reads for up to 8MB of data in advance. With this change it also issues indirect block reads for up to 64MB of data in advance, so that when it will be time to actually read those data, it can be done immediately. Alike effect can be achieved by just increasing maximal data prefetch distance, but at higher memory cost.
Also this change introduces indirect block prefetch for rewrite operations, that was never done before. Previously ARC miss for Indirect blocks regularly blocked rewrites, converting perfectly aligned asynchronous operations into synchronous read-write pairs, significantly reducing maximal rewrite speed.
While being there this issue was also fixed: - prefetch was done always, even if caching for the dataset was completely disabled.
Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool of 12 assorted HDDs shown me such performance numbers: ------- BEFORE -------- Write 491363677 bytes/sec Read 312430631 bytes/sec Rewrite 97680464 bytes/sec -------- AFTER -------- Write 493524146 bytes/sec Read 438598079 bytes/sec Rewrite 277506044 bytes/sec
Closes #65 Closes #80
openzfs/openzfs@792fd28ac04f78cc5e43ead2d72a96f244ea84e8
|
296519 |
08-Mar-2016 |
mav |
MFV r296518: 5027 zfs large block support (add copyright)
Author: Matthew Ahrens <matt@mahrens.org>
illumos/illumos-gate@c3d26abc9ee97b4f60233556aadeb57e0bd30bb9
|
296516 |
08-Mar-2016 |
mav |
MFV r296515: 6536 zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS
Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andrew Stormont <astormont@racktopsystems.com>
illumos/illumos-gate@880094b6062aebeec8eda6a8651757611c83b13e
|
296510 |
08-Mar-2016 |
mav |
MFV r296505: 6531 Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@97e81309571898df9fdd94aab1216dfcf23e060b
|
294815 |
26-Jan-2016 |
mav |
MFV r294814: 6393 zfs receive a full send as a clone
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@68ecb2ec930c4b0f00acaf8e0abb2b19c4b8b76f
This allows to do a full (non-incremental send) and receive it as a clone of an existing dataset. It can leverage nopwrite to share blocks with the origin. This can be used to change the relationship of datasets on the target. For example, maybe on the source you have:
A ---- B ---- C
And you have sent to the target a full of B, and the incremental B->C:
B ---- C
You later realize that you want to have A on the target. You will have to do a full send of A, but nopwrite can save you space on the target if you receive it as a clone of B, assuming that A and B have some blocks inxi common:
B ---- C \ A
|
294811 |
26-Jan-2016 |
mav |
MFV r294810: 6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Will Andrews <will@firepipe.net>
illumos/illumos-gate@eb5bb58421f46cee79155a55688e6c675e7dd361
|
294329 |
19-Jan-2016 |
asomers |
Disallow zvol-backed ZFS pools
Using zvols as backing devices for ZFS pools is fraught with panics and deadlocks. For example, attempting to online a missing device in the presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better to completely disable vdev_geom from ever opening a zvol. The solution relies on setting a thread-local variable during vdev_geom_open, and returning EOPNOTSUPP during zvol_open if that thread-local variable is set.
Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent was to prevent a recursive mutex acquisition panic. However, the new check for the thread-local variable also fixes that problem.
Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this function was set to panic. But it can occur that a device disappears during tasting, and it causes no problems to ignore this departure.
Reviewed by: delphij MFC after: 1 week Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4986
|
289562 |
19-Oct-2015 |
mav |
MFV r289561: 6328 Fix cstyle errors in zfs codebase
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@9a686fbc186e8e2a64e9a5094d44c7d6fa0ea167
|
289422 |
16-Oct-2015 |
mav |
MFV r289310: 4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@45818ee124adeaaf947698996b4f4c722afc6d1f
This is only a partial merge of respective ZFS infrastructure changes. At this moment FreeBSD kernel has no those crypto algorithms, so the parts of the code to enable them are commented out. When they are implemented, it will be trivial to plug them in.
|
289362 |
15-Oct-2015 |
mav |
MFV r289312: 2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@9c3fd1216fa7fb02cfbc78a2518a686d54b48ab8
For more info, see: - slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive - video https://www.youtube.com/watch?v=iY44jPMvxog - manpage changes (for zfs resume -s and zfs send -t) - upcoming talk at the OpenZFS Developer Summit
The TL;DR is: Use "zfs receive -s" to save the partially received state on failure. On failure, get the receive token with "zfs get receive_resume_token <fs>" Resume the send with "zfs send -t <token_value>"
Relnotes: yes
|
289309 |
14-Oct-2015 |
mav |
MFV r289308: 6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Xin LI <delphij@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Justin T. Gibbs <gibbs@FreeBSD.org>
illumos/illumos-gate@d2058105c61ec61df3a2dd3f839fed8c3fe7bfd6
|
288204 |
25-Sep-2015 |
delphij |
MFV r288063: make dataset property de-registration operation O(1)
A change to a property on a dataset must be propagated to its descendants in case that property is inherited. For datasets whose information is not currently loaded into memory (e.g. a snapshot that isn't currently mounted), there is nothing to do; the property change will take effect the next time that dataset is loaded. To handle updates to datasets that are in-core, ZFS registers a callback entry for each property of each loaded dataset with the dsl directory that holds that dataset. There is a dsl directory associated with each live dataset that references both the live dataset and any snapshots of the live dataset. A property change is effected by doing a traversal of the tree of dsl directories for a pool, starting at the directory sourcing the change, and invoking these callbacks.
The current implementation both registers and de-registers properties individually for each loaded dataset. While registration for a property is O(1) (insert into a list), de-registration is O(n) (search list and then remove). The 'n' for de-registration, however, is not limited to the size (number of snapshots + 1) of the dsl directory. The eviction portion of the life cycle for the in core state of datasets is asynchronous, which allows multiple copies of the dataset information to be in-core at once. Only one of these copies is active at any time with the rest going through tear down processing, but all copies contribute to the cost of performing a dsl_prop_unregister().
One way to create multiple, in-flight copies of dataset information is by performing "zfs list" operations from multiple threads concurrently. In-core dataset information is loaded on demand and then evicted when reference counts drops to zero. For datasets that are not mounted, there is no persistent reference count to keep them resident. So, a list operation will load them, compute the information required to do the list operation, and then evict them. When performing this operation from multiple threads it is possible that some of the in-core dataset information will be reused, but also possible to lose the race and load the dataset again, even while the same information is being torn down.
Compounding the performance issue further is a change made for illumos issue 5056 which made dataset eviction single threaded. In environments using automation to manage ZFS datasets, it is now possible to create enough of a backlog of dataset evictions to consume excessive amounts of kernel memory and to bog down the system.
The fix employed here is to make property de-registration O(1). With this change in place, it is hoped that a single thread is more than sufficient to handle eviction processing. If it isn't, the problem can be solved by increasing the number of threads devoted to the eviction taskq.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h: Associate dsl property callback records with both the dsl directory and the dsl dataset that is registering the callback. Both connections are protected by the dsl directory's "dd_lock".
When linking callbacks into a dsl directory, group them by the property type. This helps reduce the space penalty for the double association (the property name pointer is stored once per dsl_dir instead of in each record) and reduces the number of strcmp() calls required to do callback processing when updating a single property. Property types are stored in a linked list since currently ZFS registers a maximum of 10 property types for each dataset.
Note that the property buckets/records associated with a dsl directory are created on demand, but only freed when the dsl directory is freed. Given the static nature of property types and their small number, there is no benefit to freeing the few bytes of memory used to represent the property record earlier. When a property record becomes empty, the dsl directory is either going to become unreferenced a little later in this thread of execution, or there is a high chance that another dataset is going to be loaded that would recreate the bucket anyway.
Replace dsl_prop_unregister() with dsl_prop_unregister_all(). All callers of dsl_prop_unregister() are trying to remove all property registrations for a given dsl dataset anyway. By changing the API, we can avoid doing any lookups of callbacks by property type and just traverse the list of all callbacks for the dataset and free each one.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c: Replace use of dsl_prop_unregister() with the new dsl_prop_unregister_all() API.
illumos/illumos-gate@03bad06fbb261fd4a7151a70dfeff2f5041cce1f Author: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>
Illumos issue: 6171 dsl_prop_unregister() slows down dataset eviction https://www.illumos.org/issues/6171
MFC after: 2 weeks
|
287706 |
12-Sep-2015 |
delphij |
MFV r287699: 6214 zpools going south
In r286570 (MFV of r277426) an unprotected write to b_flags to set the compression mode was introduced. This would open a race window where data is partially decompressed, modified, checksummed and written to the pool, resulting in pool corruption due to the partial decompression.
Prevent this by reintroducing b_compress
illumos/illumos-gate@d4cd038c92c36fd0ae35945831a8fc2975b5272c
Illumos issues:
6214 zpools going south https://www.illumos.org/issues/6214
|
287702 |
12-Sep-2015 |
delphij |
MFV r287624: 5987 zfs prefetch code needs work
Rewrite the ZFS prefetch code to detect only forward, sequential streams.
The following kstats have been added:
kstat.zfs.misc.arcstats.sync_wait_for_async
How many sync reads have waited for async read to complete. (less is better)
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch
How many demand read didn't have to wait for I/O because of predictive prefetch. (more is better)
zfetch kstats have been similified to hits, misses, and max_streams, with max_streams representing times when we were not able to create new stream because we already have the maximum number of sequences for a file.
The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes to prefetch per stream.
illumos/illumos-gate@cf6106c8a0d6598b045811f9650d66e07eb332af
Illumos ZFS issues:
5987 zfs prefetch code needs work https://www.illumos.org/issues/5987
|
287103 |
24-Aug-2015 |
avg |
MFV (partial) r286889: 5692 expose the number of hole blocks in a file
FreeBSD porting notes: - only kernel-side changes are merged - the new ioctl is not actually implemented yet - thus, the goal is to synchronize DMU code
illumos/illumos-gate@2bcf0248e992f292c7b814458bcdce2f004925d6
https://www.illumos.org/issues/5692 we would like to expose the number of hole (sparse) blocks in a file. this can be useful to for example if you want to fill in the holes with some data; knowing the number of holes in advances allows you to report progress on hole filling. We could use SEEK_HOLE to do that but it would be O(n) where n is the number of holes present in the file.
Author: Max Grossman <max.grossman@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>
|
286763 |
14-Aug-2015 |
mav |
MFV r277431: 5497 lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@244781f10dcd82684fd8163c016540667842f203
This patch attempts to reduce lock contention on the current arc_state_t mutexes. These mutexes are used liberally to protect the number of LRU lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at which these locks are acquired has been shown to greatly affect the performance of highly concurrent, cached workloads.
|
286708 |
13-Aug-2015 |
mav |
MFV 286707: 5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@ca0cc3918a1789fa839194af2a9245f801a06b1a
A ZFS feature flags (large blocks) tracks its refcounts as the number of datasets that have ever used the feature. Several features of this type are planned to be added (new checksum functions). This code should be made common infrastructure rather than duplicating the code for each feature.
|
286705 |
12-Aug-2015 |
mav |
MFV r286704: 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>
While running 'zfs recv' we noticed that every 128th 8K block required a read. We were seeing that restore_write() was calling dmu_tx_hold_write() and the indirect block was not cached. We should prefetch upcoming indirect blocks to avoid having to go to disk and blocking the restore_write().
Allow an incremental send stream to be received as a clone, even if the stream does not mark it as a clone.
|
286689 |
12-Aug-2015 |
mav |
MFV r284763: 5981 Deadlock in dmu_objset_find_dp
illumos/illumos-gate@1d3f896f5469c69c1339890ec3d68e9feddb0343
https://www.illumos.org/issues/5981 When dmu_objset_find_dp gets called with a read lock held, it fans out the work to the task queue. Each task in turn acquires its own read lock before calling the callback. If during this process anyone tries to a acquire a write lock, it will stall all read lock requests.Thus the tasks will never finish, the read lock of the caller will never get freed and the write lock never acquired. deadlock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Arne Jansen <jansen@webgods.de>
|
286686 |
12-Aug-2015 |
mav |
MFV r284762: 5269 zpool import slow
illumos/illumos-gate@12380e1e701fda28c9e9f32d01cafb54af279eb5
https://www.illumos.org/issues/5269 When importing a pool (at boot or with zpool import) with many filesystem, the process can take minutes. It doesn't matter whether the pool has been exported cleanly or uncleanly. The problem is that each dataset has its own log chain. On import, all datasets have to be checked if there are logs to replay. The idea is to speed up this process by paralellizing it.
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Arne Jansen <jansen@webgods.de>
|
286683 |
12-Aug-2015 |
mav |
MFV r286682: 5765 add support for estimating send stream size with lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@nexenta.com> Author: Max Grossman <max.grossman@delphix.com>
illumos/illumos-gate@643da460c8ca583e39ce053081754e24087f84c8
|
286677 |
12-Aug-2015 |
mav |
MFV r286224: 5695 dmu_sync'ed holes do not retain birth time
illumos/illumos-gate@70163ac57e58ace1c5c94dfbe85dca5a974eff36
https://www.illumos.org/issues/5695 In dmu_sync_ready(), a hole block pointer will have it's logical size explicitly set as it's necessary for replay purposes. To "undo" this, dmu_sync_done() will zero out any hole that it finds. This becomes a problem when using the "hole_birth" feature, as this will also wipe out any birth time that might have happened to be set on the hole. ... As a fix, the logic to zero out a hole is only applied to old style holes with a birth time of zero. Holes created with the "hole_birth" feature enabled will have a non-zero birth time, and will be skipped (thus preserving the ltime, type, and level information as well). In addition, zdb was updated to also print the ltime, type, and level information for these new style holes. Previously, only the logical birth time would be printed.
Author: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com>
|
286603 |
10-Aug-2015 |
mav |
MFV 286602: 5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@732885fca09e11183dd0ea69aaaab5588fb7dbff
|
286587 |
10-Aug-2015 |
mav |
MFV 286586: 5746 more checksumming in zfs send
Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@98110f08fa182032082d98be2ddb9391fcd62bf1
|
286579 |
10-Aug-2015 |
mav |
MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@SpectraLogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com>
illumos/illumos-gate@fe319232d24f4ae183730a5a24a09423d8ab4429
|
286575 |
10-Aug-2015 |
mav |
MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds
Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Justin Gibbs <justing@spectralogic.com>
illumos/illumos-gate@bc9014e6a81272073b9854d9f65dd59e18d18c35
|
286574 |
10-Aug-2015 |
mav |
MFV r277427: 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <bayard.bell@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@4076b1bf41cfd9f968a33ed54a7ae76d9e996fe8
|
286570 |
10-Aug-2015 |
mav |
MFV r277426: 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Chris Williamson <Chris.Williamson@delphix.com>
illumos/illumos-gate@89c86e32293a30cdd7af530c38b2073fee01411c
Currently, every buffer cached in the L2ARC is accompanied by a 240-byte header in memory, leading to very high memory consumption when using very large cache devices. These changes significantly reduce this overhead.
Currently:
L1-only header = 176 bytes L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr = 240 bytes
Memory-optimized:
L1-only header = 176 bytes L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes L2-only header = 96 bytes + 32 byte checksum = 128 bytes
So overall:
Trunk Optimized +-----------------+ L1-only | 176 B | 176 B | (same) +-----------------+ L1 & L2 | 240 B | 208 B | (saved 32 bytes) +-----------------+ L2-only | 240 B | 128 B | (saved 116 bytes) +-----------------+
For an average blocksize of 8KB, this means that for the L2ARC, the ratio of metadata to data has gone down from about 2.92% to 1.56%. For a 'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this means that we expect a completely full L2ARC to use (1600 GB * 0.0156) / 60GB = 41% of the available memory, down from 78%.
|
286547 |
09-Aug-2015 |
mav |
MFV 286546: 5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com>
illumos/illumos-gate@db1741f555ec79def5e9846e6bfd132248514ffe
|
286545 |
09-Aug-2015 |
mav |
MFV 286544: 5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com>
|
286541 |
09-Aug-2015 |
mav |
MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds()
Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com>
illumos/illumos-gate@e57a022b8f718889ffa92adbde47a8f08abcdb25
|
284593 |
19-Jun-2015 |
avg |
MFV r284412: 5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@46e1baa6cf6d5432f5fd231bb588df8f9570c858
https://www.illumos.org/issues/5911 Sometimes ZFS appears to hang while deleting a file. It is actually making slow progress at the file deletion, but other operations (administrative and writes via the data path) "hang" until the file removal completes, which can take a long time if the file has many blocks. The deletion (or most of it) happens in a single txg, and the sync thread spends most of its time reading indirect blocks via this stack trace: swtch+0x141() cv_wait+0x70() zio_wait+0x5b() dbuf_read+0x2c0() free_children+0x50() free_children+0x12a() free_children+0x12a() free_children+0x12a() dnode_sync_free_range_impl+0xdf() dnode_sync_free_range+0x52() range_tree_vacate+0x65() dnode_sync+0x1d8() dmu_objset_sync_dnodes+0x77() dmu_objset_sync+0x19f() dsl_dataset_sync+0x51() dsl_pool_sync+0x9a() spa_sync+0x2ff() txg_sync_thread+0x21f() thread_start+8() One way to reproduce the problem is if we are over the arc_meta_limit, e.g. because lots of indirect blocks are pinned because we have L0 dbufs under them. It could be that most of the L1 indirects are cached, in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(), it will complete very quickly. This allows dmu_free_long_range_impl() to put many (perhaps all of its) transactions in the same TXG. However, dmu_free_long_range_impl() calls dnode_evict_dbufs (and dnode_free_range()), which removes the L0 dbufs, thus reducing the hold count on the L1 indirect blocks above it, allowing them to be evicted. Because we are over the arc_meta_limit(), these L1 blocks will be evicted ASAP. Thus when we get to syncing context, the L1 indirects are no longer cached and must be read in.
Obtained from: illumos MFC after: 15 days
|
284304 |
12-Jun-2015 |
avg |
MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size
illumos/illumos-gate@81cd5c555f505484180a62ca5a2fbb00d70c57d6
Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 17 days
|
282475 |
05-May-2015 |
avg |
zfs: do not hold an extra reference on a root vnode while a filesystem is mounted
At present zfs_domount() acquires a reference on the filesystem's root vnode and that reference is kept until zfs_umount. The latter calls vflush(rootrefs = 1) to dispose of the extra reference.
There is no explanation of why that reference is kept - what problem it solves or what behavior it improves. Also, that logic is FreeBSD specific.
There is one real problem with that reference, though. zfs recv -F may receive a full, non-incremental stream to a mounted filesystem. In that case the received root object is likely to have a different z_gen attribute value. Because of that, zfs_rezget will leave the previous root znode and vnode disassociated from the actual object (z_sa_hdl == NULL). Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode pair, while the old one will be kept alive by the outstanding reference. So, the outstanding reference will not actually be for the new root vnode (or, more precisely, vnodes - because a root vnode may be recycled and a newer one can be created). As a result, when vflush(rootrefs = 1) s called there will be two problems:
- a leaked reference on the old root vnode preventing a graceful unmount - insufficient references on the actual root vnode leading to a crash upon access to the vnode after it is destroyed by vgone() + vdrop()
The second issue will actually override the first one.
Differential Revision: https://reviews.freebsd.org/D2353 Reviewed by: delphij, kib, smh MFC after: 17 days
|
277300 |
17-Jan-2015 |
smh |
Mechanically convert cddl sun #ifdef's to illumos
Since the upstream for cddl code is now illumos not sun, mechanically convert all sun #ifdef's to illumos #ifdef's which have been used in all newer code for some time.
Also do a manual pass to correct the use if #ifdef comments as per style(9) as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.
MFC after: 1 month Sponsored by: Multiplay
|
276069 |
22-Dec-2014 |
smh |
Fix panic when resizing ZFS zvol's
Resizing a ZFS ZVOL with debug enabled would result in a panic due to recursion on dp_config_rwlock.
The upstream change "3464 zfs synctask code needs restructuring" changed zvol_set_volsize to avoid the recursion on dp_config_rwlock, but this was missed when originally merged in by r248571 due to significant differences in our codebases in this area.
These changes also relied on bring in changes from upstream: 3557 dumpvp_size is not updated correctly when a dump zvol's size is changed, which where also not present.
In order to help prevent future issues in this area a direct comparison and diff minimisation from current upstream version (b515258) of zvol.c.
Differential Revision: https://reviews.freebsd.org/D1302 MFC after: 1 month X-MFC-With: r276063 & r276066 Sponsored by: Multiplay
|
275811 |
15-Dec-2014 |
delphij |
MFV r275783:
Convert ARC flags to use enum. Previously, public flags are defined in arc.h and private flags are defined in arc.c which can lead to confusion and programming errors.
Consistently use 'hdr' (when referencing arc_buf_hdr_t) instead of 'buf' or 'ab' because arc_buf_t are often named 'buf' as well.
Illumos issue: 5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme
MFC after: 2 weeks
|
275782 |
15-Dec-2014 |
delphij |
MFV r275551:
Remove "dbuf phys" db->db_data pointer aliases.
Use function accessors that cast db->db_data to the appropriate "phys" type, removing the need for clients of the dmu buf user API to keep properly typed pointer aliases to db->db_data in order to conveniently access their data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c: In zap_leaf() and zap_leaf_byteswap, now that the pointer alias field l_phys has been removed, use the db_data field in an on stack dmu_buf_t to point to the leaf's phys data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: Remove the db_user_data_ptr_ptr field from dbuf and all logic to maintain it.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c: Modify the DMU buf user API to remove the ability to specify a db_data aliasing pointer (db_user_data_ptr_ptr).
cddl/contrib/opensolaris/cmd/zdb/zdb.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_diff.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h: Create and use the new "phys data" accessor functions dsl_dir_phys(), dsl_dataset_phys(), zap_m_phys(), zap_f_phys(), and zap_leaf_phys().
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h: Remove now unused "phys pointer" aliases to db->db_data from clients of the DMU buf user API.
Illumos issue: 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS
MFC after: 2 weeks
|
275781 |
15-Dec-2014 |
delphij |
MFV r275550:
In addition to r273158, make the code in spa_sync() that checks if the current TXG is a no-op TXG less fragile.
Illumos issue: 5347 idle pool may run itself out of space
MFC after: 2 weeks
|
275740 |
13-Dec-2014 |
delphij |
MFV r275548:
Verify that the block pointer is structurally valid, before attempting to read it in. It can only be invalid in the case of a ZFS bug, but this change will help identify such bugs in a more transparent way, by panic'ing with a relevant message, rather than indexing off the end of an array or something.
Illumos issue: 5349 verify that block pointer is plausible before reading
MFC after: 2 weeks
|
275594 |
08-Dec-2014 |
delphij |
MFV r275540:
When importing a pool, don't assume that the passed pool configuration at vdev_load is always vaild. It's possible that a stale configuration that comes with extra vdevs, where metaslab_init() would fail because of lower layer returns error.
Change the code to make metaslab_init() handle and return errors from lower layer and pass it back to upper layer and handle it there.
Illumos issue: 5213 panic in metaslab_init due to space_map_open returning ENXIO
MFC after: 2 weeks
|
274337 |
10-Nov-2014 |
delphij |
MFV r274273:
ZFS large block support.
Please note that booting from datasets that have recordsize greater than 128KB is not supported (but it's Okay to enable the feature on the pool). This *may* remain unchanged because of memory constraint.
Limited safety belt is provided for mounted root filesystem but use caution is advised.
Illumos issue: 5027 zfs large block support
MFC after: 1 month
|
274304 |
09-Nov-2014 |
delphij |
MFV r274272 and diff reduction with upstream.
Illumos issue: 5244 zio pipeline callers should explicitly invoke next stage
Tested with: ztest plus ZFS over GELI configuration MFC after: 1 month
|
272810 |
09-Oct-2014 |
delphij |
MFV r272804:
Refactor the code and stop restore_object from creating two transactions.
Illumos issue: 3693 restore_object uses at least two transactions to restore an object
MFC after: 2 weeks
|
272809 |
09-Oct-2014 |
delphij |
MFV r272803:
Illumos issue: 5175 implement dmu_read_uio_dbuf() to improve cached read performance
MFC after: 2 weeks
|
272598 |
06-Oct-2014 |
delphij |
MFV r272585:
Split the godfather zio into CPU number's to reduce lock contention.
Illumos issue: 5176 lock contention on godfather zio
MFC after: 2 weeks
|
272504 |
04-Oct-2014 |
delphij |
MFV r272494:
Make space_map_truncate() always do space_map_reallocate(). Without this, setting space_map_max_blksz would cause panic for existing pool, as dmu_objset_set_blocksize would fail if the object have multiple blocks.
Illumos issues: 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled spacemap_histogram feature
MFC after: 2 weeks
|
271754 |
18-Sep-2014 |
smh |
Remove unused ZFS ARC functions
* arc_data_buf_alloc * arc_data_buf_free
MFC after: 1 week Sponsored by: Multiplay
|
271526 |
13-Sep-2014 |
delphij |
MFV r271510:
Enforce 4K as smallest indirect block size (previously the smallest indirect block size was 1K but that was never used).
This makes some space estimates more accurate and uses less memory for some data structures.
Illumos issue: 5141 zfs minimum indirect block size is 4K
MFC after: 2 weeks
|
270383 |
22-Aug-2014 |
delphij |
Instead of using timestamp in the AVL, use the memory address when comparing.
Illumos issue: 5095 panic when adding a duplicate dbuf to dn_dbufs
MFC after: 3 days
|
270247 |
20-Aug-2014 |
delphij |
MFC r270195:
Illumos issue: 5045 use atomic_{inc,dec}_* instead of atomic_add_*
MFC after: 2 weeks
|
269431 |
02-Aug-2014 |
delphij |
MFV r269427:
In dnode_children_t, use C99's "[]" idiom for declaring the variable sized array dnc_children at the end of the structure.
This prevents the compiler from mistakenly optimizing away accesses beyond the array's defined size.
Illumos issue: 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com>
MFC after: 2 weeks
|
269407 |
01-Aug-2014 |
smh |
Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods
This prevents recursion of vdev_queue_io_done as per r265321 but using a different method as recommended on the openzfs list.
We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.
zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns ZIO_PIPELINE_STOP to ensure future changes don't reintroduce ZIO_PIPELINE_CONTINUE returns.
Cleanup flow in vdev_geom_io_start while I'm here.
Also fix some cases not using SET_ERROR(..)
MFC after: 2 weeks X-MFC-With: r265321
|
269229 |
29-Jul-2014 |
delphij |
MFV r269223:
Change dn->dn_dbufs from linked list to AVL tree.
Illumos issues: 4873 zvol unmap calls can take a very long time for larger datasets
MFC after: 2 weeks
|
269118 |
26-Jul-2014 |
delphij |
MFV r269010:
Import Illumos changes to address the following Illumos issues: 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4984 device selection should use fragmentation metric
MFC after: 2 weeks
|
269086 |
25-Jul-2014 |
delphij |
As of r268075, the responsibility of rounding up buffer to optimal size have been transferred from zio_compress_data to its caller. Therefore, passing the 'minblocksize' down will be a no-op.
Eliminate the parameter to reduce diff against upstream.
MFC after: 2 weeks
|
268980 |
22-Jul-2014 |
delphij |
Correct typo introduced with r268855.
MFC after: 10 days X-MFC with: r268855
|
268865 |
19-Jul-2014 |
delphij |
Reduce lock contention on the z_teardown_lock under heavily cached read workload by splitting the single teardown rrw lock into RRM_NUM_LOCKS (17) of them.
Read acquisitions are randomly distributed among these locks based on curthread pointer. Write acquisitions are going to all the locks, which for the usage of this type of lock should be rare.
Illumos issue: 5008 lock contention (rrw_exit) while running a read only load
MFC after: 2 weeks
|
268859 |
18-Jul-2014 |
delphij |
MFV r268851:
When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number).
Illumos issue: 4753 increase number of outstanding async writes when sync task is waiting
MFC after: 2 weeks
|
268858 |
18-Jul-2014 |
delphij |
MFV r268850:
Change the interaction between the DMU and ARC so that when the DMU is shutting down an objset, we do not evict the data from the ARC. Instead we simply coordinate the destruction of the DMU's data with the ARC.
The only case where we actually need to explicitly evict from the ARC is when dbuf_rele_and_unlock() determines that the administrator has requested that it not be kept in memory, via the primarycache/secondarycache properties. In this case, we evict the data from the ARC by its blkptr_t, the same way as when a block is freed we explicitly evict it from the ARC.
Illumos issue: 4631 zvol_get_stats triggering too many reads
MFC after: 2 weeks
|
268855 |
18-Jul-2014 |
delphij |
MFV r268848:
Instead of asserting all zio's be properly aligned, only assert on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be only one uberblock.
This fixes a problem that zdb would trip assert on pools with ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense space map unless the uncondensed size consumes greater than zfs_metaslab_condense_block_threshold blocks.
Illumos issue: 4958 zdb trips assert on pools with ashift >= 0xe
MFC after: 2 weeks
|
268473 |
09-Jul-2014 |
delphij |
MFV r268455:
Use reserved space for ZFS administrative commands.
We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at least) for system use. Most ZPL operations, e.g. write(2), creat(2), will fail with ENOSPC if we fall below this.
Certain operations, e.g. file removal and most administrative actions, still permitted until half of the slop space is used. This would allow users to use these operations to free up space in the pool when pool is close to full but half of slop space is still free.
A very restricted set of operations that frees up space or change quota are always permitted, regardless of the amount of free space.
MFC after: 2 weeks
|
268464 |
09-Jul-2014 |
delphij |
MFV r268452:
Explicitly mark file removal transactions as "presumed to result in a net free of space" so they will not fail with ENOSPC.
Illumos issue: 4950 files sometimes can't be removed from a full filesystem MFC after: 2 weeks
|
268123 |
01-Jul-2014 |
delphij |
MFV r268119:
4914 zfs on-disk bookmark structure should be named *_phys_t
illumos/illumos-gate@7802d7bf98dec568dadf72286893b1fe5abd8602
MFC after: 2 weeks
|
268079 |
01-Jul-2014 |
delphij |
MFV r267566:
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
MFC after: 2 weeks
|
268075 |
01-Jul-2014 |
delphij |
MFV r267565:
4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks
MFC after: 2 weeks
|
266771 |
27-May-2014 |
delphij |
MFV r266766:
Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. When set to all, ZFS stores an extra copy of all metadata. If a single on-disk block is corrupt, at worst a single block of user data (which is recordsize bytes long) can be lost.
Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. This can improve performance of random writes, because less metadata has to be written. In practice, at worst about 100 blocks (of recordsize bytes each) of user data can be lost if a single on-disk block is corrupt.
The exact behavior of which metadata blocks are stored redundantly may change in future releases.
Illumos issue: 3835 zfs need not store 2 copies of all metadata
MFC after: 2 weeks
|
265321 |
04-May-2014 |
smh |
Use a zio flag to prevent recursion of vdev_queue_io_done which can cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE from the zio_vdev_io_start stage and hence don't suspend and complete in a different thread.
This prevents double fault panic on slow machines running ZFS on GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.
MFC after: 1 month X-MFC-With: r265152
|
265152 |
30-Apr-2014 |
smh |
Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it translated the required geom values. This reduces the amount of changes required for FREE requests to be supported by the new IO scheduler. This also eliminates the need for a specific DKIOCTRIM.
Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part of their schedule.
As the new IO scheduler can result in a request to execute one type of IO to actually run a different type of IO it requires that zio_trim requests are processed without holding the trim map lock (tm->tm_lock), as the free request execute call may result in write request running hence triggering a trim_map_write_start call, which takes the trim map lock and hence would result in recused on no-recursive sx lock.
This is based off avg's original work, so credit to him.
MFC after: 1 month
|
265046 |
28-Apr-2014 |
smh |
Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting in the wrong ZIO continuing its pipeline.
This is a serious issue which could cause data loss / corruption but appears to be limited to error handling such as when vdev_readable(vd) returns false.
MFC after: 2 days
|
264850 |
24-Apr-2014 |
smh |
Add the ability to set a minimum ashift size for ZFS pool creation or root level vdev addition.
Change max_auto_ashift sysctl to error when an invalid value is requested instead of silently limiting it.
|
264835 |
23-Apr-2014 |
delphij |
MFV r264829:
3897 zfs filesystem and snapshot limits
MFC after: 2 weeks
|
264671 |
18-Apr-2014 |
delphij |
MFV r264668:
4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed
illumos/illumos@b6240e830b871f59c22a3918aebb3b36c872edba
MFC after: 2 weeks
|
264669 |
18-Apr-2014 |
delphij |
MFV r264666:
4374 dn_free_ranges should use range_tree_t
illumos/illumos-gate@bf16b11e8deb633dd6c4296d46e92399d1582df4
MFC after: 2 weeks
|
260185 |
02-Jan-2014 |
delphij |
MFV r260155:
When we encounter an I/O error on a piece of metadata while deleting a file system or zvol, we don't update the bptree_entry_phys_t's bookmark. This would lead to double free of bp's which will lead to space map corruption.
Instead of tolerating and allowing the corruption, panic immediately.
See Illumos #4390 for more details.
4391 panic system rather than corrupting pool if we hit bug 4390
Illumos/illumos-gate@8b36997aa24d9817807faa4efa301ac9c07a2b78
MFC after: 2 weeks
|
260183 |
02-Jan-2014 |
delphij |
MFV r260154 + 260182:
4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools
Illumos/illumos-gate@78f171005391b928aaf1642b3206c534ed644332
MFC after: 2 weeks
|
260150 |
01-Jan-2014 |
delphij |
MFV r259170:
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
illumos/illumos-gate@43466aae47bfcd2ad9bf501faec8e75c08095e4f
NOTE: Make sure the boot code is updated if a zpool upgrade is done on boot zpool.
MFC after: 2 weeks
|
260141 |
31-Dec-2013 |
delphij |
MFV r258385:
(Note: this change is not applicable to FreeBSD and the file is not included in build. It's integrated for completeness).
4128 disks in zpools never go away when pulled
illumos/illumos-gate@39cddb10a31c1c2e66aed69e6871d09caa4c8147
MFC after: 2 weeks
|
260138 |
31-Dec-2013 |
delphij |
MFV r242733:
3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help message
illumos/illumos-gate@31d7e8fa33fae995f558673adb22641b5aa8b6e1
FreeBSD porting notes: the kernel part of this changeset depends on Solaris buf(9S) interfaces and are not really applicable for our use. vdev_disk.c is patched as-is to reduce diverge from upstream, but vdev_file.c is left intact.
MFC after: 2 weeks
|
259813 |
24-Dec-2013 |
delphij |
MFV r258374:
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features
illumos/illumos-gate@2acef22db7808606888f8f92715629ff3ba555b9
MFC after: 2 weeks
|
258745 |
29-Nov-2013 |
avg |
zfs: add dmu_write_pages variant for freebsd
The freebsd variant of dmu_write_pages is hidden under _KERNEL to avoid needlessly pulling in vm_page_t declaration. Besides, this function seems to be useless for ZFS userland counterpart.
MFC after: 15 days
|
258717 |
28-Nov-2013 |
avg |
MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control
4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4104 ::spa_space no longer works 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab
illumos/illumos-gate@0713e232b7712cd27d99e1e935ebb8d5de61c57d
Note that some tunables have been removed and some new tunables have been added. Of particular note, FreeBSD-only knob vfs.zfs.space_map_last_hope is removed as it was a nop for some time now (after one of the previous merges from upstream).
MFC after: 11 days Sponsored by: HybridCluster [merge]
|
258634 |
26-Nov-2013 |
avg |
MFV r258376: 3964 L2ARC should always compress metadata buffers
illumos/illumos-gate@e4be62a2b74a8f09bb669217a1a39eee069b13a1
MFC after: 10 days Sponsored by: HybridCluster [merge]
|
258633 |
26-Nov-2013 |
avg |
MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool 4081 need zfs_mg_noalloc_threshold
illumos/illumos-gate@22e30981d82a0b6dc89253596ededafae8655e00
MFC after: 10 days Sponsored by: HybridCluster [merge]
|
258632 |
26-Nov-2013 |
avg |
MFV r255255: 4045 zfs write throttle & i/o scheduler performance work
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Please note the following changes: - zio_ioctl has lost its priority parameter and now TRIM is executed with 'now' priority - some knobs are gone and some new knobs are added; not all of them are exposed as tunables / sysctls yet
MFC after: 10 days Sponsored by: HybridCluster [merge]
|
258631 |
26-Nov-2013 |
avg |
MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
illumos/illumos-gate@ec94d32216ed5705f5176582355cc311cf848e73
MFC after: 9 days Sponsored by: HybridCluster [merge]
|
258630 |
26-Nov-2013 |
avg |
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP illumos/illumos-gate@5aeb94743e3be0c51e86f73096334611ae3a058e
Essentially FreeBSD taskqueues already operate in a mode that was added to Illumos with taskq_dispatch_ent change. We even exposed the superior FreeBSD interface as taskq_dispatch_safe. Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and struct struct ostask to taskq_ent_t, so that code differences will be minimal.
After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no longer needed.
Note that this commit is not an MFV because the upstream change was not individually committed to the vendor area.
MFC after: 8 days
|
258137 |
14-Nov-2013 |
mav |
Introduce allocation cache to store LZ4 compression contexts without kicking VM subsystem twice for every written record.
Tests on 24-core system show double reduction of CPU time spent on copying single large well-compressed file.
This patch is not really needed on illumos (while not harm either) since their memory allocator by default uses caching for all requests up to 128K.
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
|
256956 |
23-Oct-2013 |
smh |
Improve ZFS N-way mirror read performance by using load and locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized.
The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay
|
255750 |
21-Sep-2013 |
delphij |
MFV r254750:
Add support of Illumos dumps on zvol over RAID-Z.
Note that this only adds the features. FreeBSD would still need more work to support dumping on zvols.
Illumos ZFS issues: 2932 support crash dumps to raidz, etc. pools
MFC after: 1 month Approved by: re (ZFS blanket)
|
255437 |
10-Sep-2013 |
delphij |
MFV r247844 (illumos-gate 13975:ef6409bc370f)
Illumos ZFS issues: 3582 zfs_delay() should support a variable resolution 3584 DTrace sdt probes for ZFS txg states
Provide a compatibility shim for Solaris's cv_timedwait_hires to help aid future porting.
Approved by: re (ZFS blanket)
|
254753 |
24-Aug-2013 |
delphij |
MFV r254747:
Fix a panic from dbuf_free_range() from dmu_free_object() while doing zfs receive. This is a regression from FreeBSD r253821.
Illumos ZFS issues: 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive
|
254711 |
23-Aug-2013 |
avg |
zfs: inline and remove zfs_vnode_lock
It didn't serve any useful purpose, but obscured file and line information useful for debugging.
MFC after: 5 days X-MFC with: r254445
|
254608 |
21-Aug-2013 |
gibbs |
Add kstat entries for ZFS compression statistics.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: Add module lifetime functions to allocate and teardown state data.
Report: - Compression attempts. - Buffers found to be empty. - Compression calls that are skipped because the data length is already less than or equal to the minimum block length. - Compression attempts that fail to yield a 12.5% compression ratio.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c: Add calls to the zio_compress.c module's init and fini functions.
Sponosred by: Spectra Logic Corporation MFC after: 2 weeks
|
254591 |
21-Aug-2013 |
gibbs |
Enhance the ZFS vdev layer to maintain both a logical and a physical minimum allocation size for devices. Use this information to automatically increase ZFS's minimum allocation size for new top-level vdevs to a value that more closely matches the optimum device allocation size.
Use GEOM's stripesize attribute, if set, as the physical sector size of the GEOM.
Calculate the minimum blocksize of each metaslab class. Use the calculated value instead of SPA_MINBLOCKSIZE (512b) when determining the likelyhood of compression yeilding a reduction in physical space usage.
Report devices with sub-optimal block size configuration in "zpool status". Also properly fail attempts to attach devices with a logical block size greater than 8kB, since this will cause corruption to ZFS's label area.
Sponsored by: Spectra Logic Corporaion MFC after: 2 weeks
Background ========== Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices.
Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks:
1) Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device.
2) The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit.
Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h: Add the SPA_ASHIFT constant. ZFS currently has a hard upper limit of 13 (8k) for ashift and this constant is used to both document and enforce this limit.
sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h: Add the VDEV_AUX_ASHIFT_TOO_BIG error code.
Add fields for exporting the configured, logical, and physical ashift to the vdev_stat_t structure.
Add VDEV_STAT_VALID() macro which can be used to verify the presence of required vdev_stat_t fields in nvlist data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: Provide a SYSCTL_PROC handler for "max_auto_ashift". Since the limit is only referenced long after boot when a create operation occurs, there's no compelling need for it to be a boot time configurable tunable. This also allows the validation code for the max_auto_ashift value to be contained within the sysctl handler.
Populate the new fields in the vdev_stat_t structure.
Fail vdev opens if the vdev reports an ashift larger than SPA_MAXASHIFT.
Propogate vdev_logical_ashift and vdev_physical_ashift between child and parent vdevs as is done for vdev_ashift.
In vdev_open(), restore code that fails opens for devices where vdev_ashift grows. This can only happen now if the device's logical ashift grows, which means it really isn't safe to use the device.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c: Update the vdev_open() API so that both logical (what was just ashift before) and physical ashift are reported.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: Add two new fields, vdev_physical_ashift and vdev_logical_ashift, to vdev_t.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c: Add vdev_ashift_optimize(). Call it anytime a new top-level vdev is allocated.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.
For each sub-optimally configured leaf vdev, report configured and native block sizes.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h: cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT. This status is reported on healthy pools containing vdevs configured to use a block size smaller than their reported physical block size.
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Update find_vdev_problem() and supporting functions to provide the full vdev_stat_t structure to problem checking routines, and to allow decent into replacing vdevs.
Add a vdev_non_native_ashift() validator which is used on the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.
cddl/contrib/opensolaris/lib/libzpool/common/kernel.c: cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h: Enhance sysctl userland stubs now that a SYSCTL_PROC handler is used in vdev.c.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h: When the group membership of a metaslab class changes (i.e. when a vdev is added or removed from a pool), walk the group list to determine the smallest block size currently available and record this in the metaslab class.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: Add the metaslab_class_get_minblocksize() accessor.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In zio_compress_data(), take the minimum blocksize as an input parameter instead of assuming SPA_MINBLOCKSIZE.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum blocksize of the device. The l2arc code performs has it's own code for deciding if compression is worth while, so this effectively disables zio_compress_data() from second guessing the original decision.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c: In zio_write_bp_init(), use the minimum blocksize of the normal metaslab class when compressing data.
|
254587 |
21-Aug-2013 |
delphij |
MFV r254421:
Illumos ZFS issues: 3996 want a libzfs_core API to rollback to latest snapshot
|
254112 |
08-Aug-2013 |
delphij |
MFV r254079:
Illumos ZFS issues: 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting
|
253992 |
06-Aug-2013 |
mav |
Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really disable TRIM otherwise.
r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has no lock dependencies and should complete immediately. Unfortunately, with our TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still acquires SCL_ZIO lock for read to be sure devices won't disappear.
When TRIM is disabled, this patch enables direct free execution from r252840 and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from the pipeline to avoid lock acquisition. Otherwise it queues free request as it was before r252840.
|
253990 |
06-Aug-2013 |
mav |
Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events. Existing async thread is running only on successfull spa_sync() completion, that is impossible in case of pool loosing required (last) disk(s). That indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the lost disks, preventing GEOM/CAM from destroying devices and reusing names on later disk reattach.
In earlier version of the patch I've tried to just run existing thread immediately, unrelated to spa_sync() completion, but that exposed number of situations where it could stuck due to locks held by stuck spa_sync(), that are required for other kinds of async events.
Experiments with OpenIndiana snapshot confirmed that they also have this issue with lost disks reattach.
|
253821 |
30-Jul-2013 |
delphij |
MFV r253783:
Skip eviction step of processing free records when doing ZFS receive to avoid the expensive search operation of non-existent dbufs in dn_dbufs.
Illumos ZFS issues: 3834 incremental replication of 'holey' file systems is slow
MFC after: 2 weeks
|
253820 |
30-Jul-2013 |
delphij |
MFV r253782:
To quote Illumos issue #3888:
When 'zfs recv -F' is used with an incremental recv it rolls back any changes made since the last snapshot in case new changes were made to the file system while the recv is in progress (without -F the recv would fail when it does it's final check to commit the recv-ed data as the recv-ed data conflicts with the newly written data).
However, if there is a snapshot taken after the recv began rolling back to the 'latest' snapshot will not help and the recv will still fail. 'zfs recv -F' should be extended to destroy any snapshots created since the source snapshot when finishing the recv (effectively rolling back through all snapshots, instead of just to the latest snapshot).
Illumos ZFS issues: 3888 zfs recv -F should destroy any snapshots created since the incremental source
MFC after: 2 weeks
|
253819 |
30-Jul-2013 |
delphij |
MFV r253781 + r253871:
Illumos ZFS issues: 3894 zfs should not allow snapshot of inconsistent dataset
MFC after: 2 weeks
|
253816 |
30-Jul-2013 |
delphij |
MFV r253780:
To quote Illumos #3875:
The problem here is that if we ever end up in the error path, we drop the locks protecting access to the zfsvfs_t prior to forcibly unmounting the filesystem. Because z_os is NULL, any thread that had already picked up the zfsvfs_t and was sitting in ZFS_ENTER() when we dropped our locks in zfs_resume_fs() will now acquire the lock, attempt to use z_os, and panic.
Illumos ZFS issues: 3875 panic in zfs_root() after failed rollback
MFC after: 2 weeks
|
253643 |
25-Jul-2013 |
mav |
Following r222950, revert unintentional change cls -> class in argument name in r245264. Aside from non-uniformity, that again confused C++ compilers.
|
252840 |
05-Jul-2013 |
mm |
MFV r252839:
Quoting illumos issue #3836: Currently zio_free() always puts the zio on a list for subsequent processing by zio_free_sync(). This is only necessary for frees that might need to issue reads (gang and dedup blocks).
By processing the majority of the frees as we encounter them, we reduce the amount of time that the spa_sync() thread spends burning CPU and not doing any i/o, thus increasing the overall write throughput of the system.
Illumos ZFS issues: 3836 zio_free() can be processed immediately in the common case
MFC after: 1 week
|
252431 |
30-Jun-2013 |
rmh |
Enable kernel-specific code for FreeBSD also on other systems that use the kernel of FreeBSD.
Reviewed by: pjd
|
251646 |
12-Jun-2013 |
delphij |
MFV r251644:
Poor ZFS send / receive performance due to snapshot hold / release processing (by smh@)
Illumos ZFS issues: 3740 Poor ZFS send / receive performance due to snapshot hold / release processing
MFC after: 2 weeks
|
251636 |
11-Jun-2013 |
delphij |
MFV r251626:
ZFS event processing should work on R/O root filesystems
Illumos ZFS issues: 3749 zfs event processing should work on R/O root filesystems
MFC after: 2 weeks
|
251633 |
11-Jun-2013 |
delphij |
MFV r251622:
ZFS shouldn't ignore errors unmounting snapshots
Illumos ZFS issues: 3744 zfs shouldn't ignore errors unmounting snapshots
MFC after: 2 weeks
|
251631 |
11-Jun-2013 |
delphij |
MFV r251620:
ZFS comments need cleaner, more consistent style
Illumos ZFS issues: 3741 zfs comments need cleaner, more consistent style
MFC after: 2 weeks
|
251629 |
11-Jun-2013 |
delphij |
MFV r251619:
ZFS needs better comments.
Illumos ZFS issues: 3741 zfs needs better comments
MFC after: 2 weeks
|
251520 |
08-Jun-2013 |
delphij |
MFV r251519:
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks
Quote from the Illumos issue:
ZFS should proactively evict freed blocks from the cache.
Even though these freed blocks will never be used again, and thus will eventually be evicted, this causes us to use memory inefficiently for 2 reasons:
1. A block that is freed has no chance of being accessed again, but will be kept in memory preferentially to a block that was accessed before it (and is thus older) but has not been freed and thus has at least some chance of being accessed again.
2. We partition the ARC into several buckets: user data that has been accessed only once (MRU) metadata that has been accessed only once (MRU) user data that has been accessed more than once (MFU) metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary control on how much memory is used to cache data vs metadata is to simply try to keep the proportion the same as it has been in the past (each bucket "evicts against" itself). The secondary control is to evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly containing freed blocks that are very old, while another bucket has more recently accessed, still-allocated blocks. Data in the useful bucket (with still-allocated blocks) may be evicted in preference to data in the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU data bucket was 27GB and the MRU metadata bucket was 256GB. However, the vast majority of data in the MRU metadata bucket (256GB) was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB) was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more investigation. However, if we stop caching freed blocks, it should reduce the impact of this more fundamental issue.
MFC after: 2 weeks
|
251478 |
06-Jun-2013 |
delphij |
MFV r251474:
* Illumos zfs issue #3137 L2ARC compression
Whether or not to compress buffers entering the L2ARC is controlled by "compression" setting on the dataset, when compression is not "off", L2ARC compression is enabled.
The compress method is always LZ4 for L2ARC when enabled because it works best for the scenario.
MFC after: 2 weeks
|
249921 |
26-Apr-2013 |
smh |
Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled Enabled ZFS TRIM by default
Reviewed by: pjd (mentor) Approved by: pjd (mentor) MFC after: 2 weeks
|
249858 |
24-Apr-2013 |
mm |
MFV r249857:
Merge vendor bugfix for a possible deadlock related to async destroy and improve write performance by introducing a new lock protecting tx_open_txg.
Illumos ZFS issues: 3642 dsl_scan_active() should not issue I/O to determine if async destroying is active 3643 txg_delay should not hold the tc_lock
MFC after: 1 week
|
249206 |
06-Apr-2013 |
mm |
MFV r248660: Merge vendor change - modify time processing in deadman thread.
Illumos ZFS issues: 3618 ::zio dcmd does not show timestamp data
MFC after: 3 weeks
|
248574 |
21-Mar-2013 |
smh |
Improve TXG handling in the TRIM module. This patch adds some improvements to the way the trim module considers TXGs:
- Free ZIOs are registered with the TXG from the ZIO itself, not the current SPA syncing TXG (which may be out of date); - L2ARC are registered with a zero TXG number, as L2ARC has no concept of TXGs; - The TXG limit for issuing TRIMs is now computed from the last synced TXG, not the currently syncing TXG. Indeed, under extremely unlikely race conditions, there is a risk we could trim blocks which have been freed in a TXG that has not finished syncing, resulting in potential data corruption in case of a crash.
Reviewed by: pjd (mentor) Approved by: pjd (mentor) Obtained from: https://github.com/dechamps/zfs/commit/5b46ad40d9081d75505d6f3bf04ac652445df366 MFC after: 2 weeks
|
248572 |
21-Mar-2013 |
smh |
Add TRIM support for L2ARC.
This adds TRIM support to cache vdevs. When ARC buffers are removed from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(), the size previously occupied by the buffer gets scheduled for TRIMming. As always, actual TRIMs are only issued to the L2ARC after txg_trim_limit.
Reviewed by: pjd (mentor) Approved by: pjd (mentor) Obtained from: https://github.com/dechamps/zfs/commit/31aae373994fd112256607edba7de2359da3e9dc MFC after: 2 weeks
|
248571 |
21-Mar-2013 |
mm |
Merge libzfs_core branch: includes MFV 238590, 238592, 247580
MFV 238590, 238592: In the first zfs ioctl restructuring phase, the libzfs_core library was introduced. It is a new thin library that wraps around kernel ioctl's. The idea is to provide a forward-compatible way of dealing with new features. Arguments are passed in nvlists and not random zfs_cmd fields, new-style ioctls are logged to pool history using a new method of history logging.
http://blog.delphix.com/matt/2012/01/17/the-future-of-libzfs/
MFV 247580 [1]: To address issues of several deadlocks and race conditions the locking code around dsl_dataset was rewritten and the interface to synctasks was changed.
User-Visible Changes: "zfs snapshot" can create more arbitrary snapshots at once (atomically) "zfs destroy" destroys multiple snapshots at once "zfs recv" has improved performance
Backward Compatibility: I have extended the compatibility layer to support full backward compatibility by remapping or rewriting the responsible ioctl arguments. Old utilities are fully supported by the new kernel module.
Forward Compatibility: New utilities work with old kernels with the following restrictions: - creating, destroying, holding and releasing of multiple snapshots at once is not supported, this includes recursive (-r) commands
Illumos ZFS issues: 2882 implement libzfs_core 2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once 3464 zfs synctask code needs restructuring
References: https://www.illumos.org/issues/2882 https://www.illumos.org/issues/2900 https://www.illumos.org/issues/3464 [1]
MFC after: 1 month Sponsored by: Hybrid Logic Inc. [1]
|
248084 |
09-Mar-2013 |
attilio |
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes.
The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
|
247398 |
27-Feb-2013 |
mm |
MFV 247176, 247178, 247315: Import metaslab_sync() speedup from vendor (illumos).
Illumos ZFS issues: 3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread 3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing) 3578 transferring the freed map to the defer map should be constant time 3579 ztest trips assertion in metaslab_weight()
References: https://www.illumos.org/issues/3552 https://www.illumos.org/issues/3564 https://www.illumos.org/issues/3578 https://www.illumos.org/issues/3579
MFC after: 2 weeks
|
247265 |
25-Feb-2013 |
mm |
MFV v242732:
Merge the ZFS I/O deadman thread from vendor (illumos). This feature panics the system on hanging ZFS I/O, helps debugging and resumes failed service.
The panic behavior can be controlled with the loader-only tunables: vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O) vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)
By default, ZFS I/O deadman is enabled by default on amd64 and i386 excluding virtual guest machines.
Illumos ZFS issues: 3246 ZFS I/O deadman thread
References: https://www.illumos.org/issues/3246
MFC after: 2 weeks
|
246666 |
11-Feb-2013 |
mm |
MFV r246392: Import vendor ZFS bugfix fixing a possible deadlock in arc_read().
Illumos ZFS issues: 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
References: https://www.illumos.org/issues/3498
MFC after: 2 weeks
|
246651 |
11-Feb-2013 |
mm |
MFV r246390: Import minor type change in refcount.h header from vendor (illumos).
MFC after: 2 weeks
|
246586 |
09-Feb-2013 |
delphij |
MFV r245512:
* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.
LZ4 is a new high-speed BSD-licensed compression algorithm created by Yann Collet that delivers very high compression and decompression performance compared to lzjb (>50% faster on compression, >80% faster on decompression and around 3x faster on compression of incompressible data), while giving better compression ratio [1].
This version of LZ4 corresponds to upstream's [2] revision 85.
Please note that for obvious reasons this is not backward read compatible. This means once a pool have LZ4 compressed data, these data can no longer be read by older ZFS implementations.
Local changes:
- On-stack hash table disabled and using kernel slab allocator instead, at this time. This requires larger kernel thread stack for zio workers. This may change in the future should we adjusted the zio workers' thread stack size. - likely and unlikely will be undefined if they are already defined, this is required for i386 XEN build. - Removed De Bruijn sequence based __builtin_ctz family of builtins in favor of the latter. Both GCC and clang supports these builtins. - Changed the way the LZ4 code detects endianness. - Manual pages modifications to mention the feature based on Illumos counterpart. - Boot loader changes to make it support LZ4 decompression.
[1] https://www.illumos.org/issues/3035 [2] http://code.google.com/p/lz4/source/list
Obtained from: Illumos (13921:9d721847e469) Tested on: FreeBSD/amd64 MFC after: 1 month
|
246531 |
08-Feb-2013 |
avg |
zfs: update comments about zfid_long_t to match the FreeBSD definitions
MFC after: 1 week
|
245264 |
10-Jan-2013 |
delphij |
The current ZFS code expects ddt_zap_count to always succeed by asserting the underlying zap_count() to return no errors. However, it is possible that the pool reaches to such a state where zap_count would return error, leading to panics when a pool is imported.
This commit changes the ddt_zap_count to return error returned from zap_count and handle the error appropriately. With this change, it's now possible to let zpool rollback damaged transaction groups and import the pool.
Obtained from: ZFS on Linux github (e8fd45a0f975c6b8ae8cd644714fc21f14fac2bf) MFC after: 1 month
|
244155 |
12-Dec-2012 |
smh |
Renamed zfs trim stats removing duplicate zio_trim identifier from the name Added description option to kstats. Added descriptions for zio_trim kstats
PR: kern/173113 Submitted by: Steven Hartland Reviewed by: pjd Approved by: pjd MFC after: 2 weeks
|
243524 |
25-Nov-2012 |
mm |
MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O, nop-write omits overwriting data if the checksum (cryptographically secure) of new data matches the checksum of existing data. It also saves space if snapshots are in use.
It currently works only on datasets with enabled compression, disabled deduplication and sha256 checksums.
IllumOS 13887:196932ec9e6a and 13888:7204b3392a58 3236 zio nop-write
References: https://www.illumos.org/issues/3236
MFC after: 2 weeks
|
243520 |
25-Nov-2012 |
avg |
zfs: overhaul zfs-vfs glue for vnode life-cycle management
* There is no need for the delayed destruction of znodes via taskqueue, now that we do not need to fear recursion from getnewvnode into zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state machine a bit simpler.
* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD vop_inactive and vop_reclaim model. All destructive actions are done in zfs_freebsd_reclaim. This allows to simplify zfs_zget logic.
* Allow zfs_zget to return a doomed vnode if the current thread already has an exclusive lock on the vnode.
* Clean up Solaris-isms like bailing out of reclaim/inactive on certain values of v_usecount (aka v_count) or directly messing with this counter.
* Do not clear z_vnode while znode is still accessible. z_vnode should be cleared only after zfs_znode_dmu_fini. Otherwise zfs_zget may get an effectively half-deconstructed znode. This allows to simplify zfs_zget logic further.
The above changes fix at least two known/reported problems:
o An indefinite wait in the following code path: vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject -> put_pages -> zfs_write -> zil_commit -> zfs_zget This happened because vgone marks a vnode as VI_DOOMED before calling VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any circumstances. The fix in this change is not complete as it won't fix a deadlock between two threads doing VOP_RECLAIM where one thread is in zil_commit trying to zfs_zget a znode/vnode being reclaimed by the other thread, which would be blocked trying to enter zil_commit. This type of deadlock has not been reported as of now.
o An indefinite wait in the unmount path caused by a znode "falling through the cracks" in inactive+reclaim. This would happen if the znode is unlinked while its vnode is still active.
To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs glue code doesn't have to re-lock a vnode but could ask for proper locking from the very start. This would also allow for the higher level code to obtain a doomed vnode when it is expected/requested. Or to avoid blocking when it is not allowed (see zil_commit example above).
ffs_vgetf seems like a good source of inspiration.
Tested by: Willem Jan Withagen <wjw@digiware.nl> MFC after: 6 weeks
|
243503 |
25-Nov-2012 |
mm |
MFV r242735:
Illumos 13879:4eac7a87eff2: 3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb() 3330 space_seg_t should have its own kmem_cache 3331 deferred frees should happen after sync_pass 1 3335 make SYNC_PASS_* constants tunable
New loader-only tunables: vfs.zfs.sync_pass_deferred_free vfs.zfs.sync_pass_dont_compress vfs.zfs.sync_pass_rewrite
References: https://www.illumos.org/issues/3329 https://www.illumos.org/issues/3330 https://www.illumos.org/issues/3331 https://www.illumos.org/issues/3335
MFC after: 2 weeks
|
242845 |
10-Nov-2012 |
delphij |
MFV r242729 (mm):
Illumos r13840:97fd5cdf328a:
3145 single-copy arc 3212 ztest: race condition between vdev_online() and spa_vdev_remove()
Illumos r13849:3468a95b27cd:
3258 ztest's use of file descriptors is unstable
|
241286 |
06-Oct-2012 |
avg |
zfs_mount: taste geom providers for root pool config
This should allow to mount a dataset as a root filesystem even if it belongs to a pool that is not described in zpool.cache. This adds some overhead to the boot process though.
If the root filesystem's pool is found in zpool.cache, the by default its cached configuration will be used for import. vfs.zfs.rootpool.prefer_cached_config could be set to zero to force the config to be retasted.
Discussed with: gibbs, pjd, des MFC after: 25 days
|
240955 |
26-Sep-2012 |
mm |
Merge recent vendor changes in ZFS.
Illumos issued covered: 2811 missing implementation: zfs send -r 3139 zdb dies when it tries to determine path of unlinked file 3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg 3208 moving zpool cross-endian results in incorrect user/group accounting
References: https://www.illumos.org/issues/ + [issue_id]
Obtained from: illumos (vendor/illumos, vendor/illumos-sys) MFC after: 2 weeks
|
240868 |
23-Sep-2012 |
pjd |
Add TRIM support.
The code builds a map of regions that were freed. On every write the code consults the map and eventually removes ranges that were freed before, but are now overwritten.
Freed blocks are not TRIMed immediately. There is a tunable that defines how many txg we should wait with TRIMming freed blocks (64 by default).
There is a low priority thread that TRIMs ranges when the time comes. During TRIM we keep in-flight ranges on a list to detect colliding writes - we have to delay writes that collide with in-flight TRIMs in case something will be reordered and write will reached the disk before the TRIM. We don't have to do the same for in-flight writes, as colliding writes just remove ranges to TRIM.
Sponsored by: multiplay.co.uk
This work includes some important fixes and some improvements obtained from the zfsonlinux project, including TRIMming entire vdevs on pool create/add/attach and on pool import for spare and cache vdevs.
Obtained from: zfsonlinux Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>
|
240631 |
18-Sep-2012 |
avg |
zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line
Discussed with: pjd MFC after: 10 days
|
240133 |
05-Sep-2012 |
mm |
Merge recent vendor changes and sync code: 1862 incremental zfs receive fails for sparse file > 8PB 3112 ztest does not honor ZFS_DEBUG 3122 zfs destroy filesystem should prefetch blocks 3129 'zpool reopen' restarts resilvers 3130 ztest failure: Assertion failed: 0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)
References: https://www.illumos.org/issues/1862 https://www.illumos.org/issues/3112 https://www.illumos.org/issues/3122 https://www.illumos.org/issues/3129 https://www.illumos.org/issues/3130
Obtained from: illumos (vendor/illumos, vendor/illumos-sys) MFC after: 2 weeks
|
239774 |
28-Aug-2012 |
mm |
Merge recent vendor changes: 3100 zvol rename fails with EBUSY when dirty 3104 eliminate empty bpobjs 3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc
References: https://www.illumos.org/issues/3100 https://www.illumos.org/issues/3104 https://www.illumos.org/issues/3120
Obtained from: illumos (vendor/illumos, vendor/illumos-sys) MFC after: 2 weeks
|
239620 |
23-Aug-2012 |
mm |
Merge recent vendor changes: 3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets 3090 vdev_reopen() during reguid causes vdev to be treated as corrupt 3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Referenes: https://www.illumos.org/issues/3086 https://www.illumos.org/issues/3090 https://www.illumos.org/issues/3102
PR: kern/170912, kern/170914 Obtained from: illumos (changeset #13776, #13777) MFC after: 2 weeks
|
238926 |
30-Jul-2012 |
mm |
Partial MFV (illumos-gate 13753:2aba784c276b) 2762 zpool command should have better support for feature flags
References: https://www.illumos.org/issues/2762
MFC after: 2 weeks
|
238113 |
04-Jul-2012 |
pjd |
vdev_io_done stage is not used for ioctls.
MFC after: 1 week
|
236884 |
11-Jun-2012 |
mm |
Introduce "feature flags" for ZFS pools (bump SPA version to 5000). Add first feature "com.delphix:async_destroy" (asynchronous destroy of ZFS datasets). Implement features support in ZFS boot code.
Illumos revisions merged: 13700:2889e2596bd6 13701:1949b688d5fb 2619 asynchronous destruction of ZFS file systems 2747 SPA versioning with zfs feature flags
References: https://www.illumos.org/issues/2619 https://www.illumos.org/issues/2747
Obtained from: illumos (issue #2619, #2747) MFC after: 1 month
|
236155 |
27-May-2012 |
mm |
Import illumos changeset 13570:3411fd5f1589 1948 zpool list should show more detailed pool information
Display per-vdev information with "zpool list -v". The added expandsize property has currently no value on FreeBSD. This changeset allows adding expansion support to individual vdevs in the future.
References: https://www.illumos.org/issues/1948
Obtained from: illumos (issue #1948) MFC after: 2 weeks
|
235222 |
10-May-2012 |
mm |
Import illumos changeset 13686:4bc0783f6064 2703 add mechanism to report ZFS send progress
If the zfs send command is used with the -v flag, the amount of bytes transmitted is reported in per second updates.
References: https://www.illumos.org/issues/2703
Obtained from: illumos (issue #2703) MFC after: 2 weeks
|
230514 |
24-Jan-2012 |
mm |
Merge illumos revisions 13572, 13573, 13574:
Rev. 13572: disk sync write perf regression when slog is used post oi_148 [1]
Rev. 13573: crash during reguid causes stale config [2] allow and unallow missing from zpool history since removal of pyzfs [5]
Rev. 13574: leaking a vdev when removing an l2cache device [3] memory leak when adding a file-based l2arc device [4] leak in ZFS from metaslab_group_create and zfs_ereport_checksum [6]
References: https://www.illumos.org/issues/1909 [1] https://www.illumos.org/issues/1949 [2] https://www.illumos.org/issues/1951 [3] https://www.illumos.org/issues/1952 [4] https://www.illumos.org/issues/1953 [5] https://www.illumos.org/issues/1954 [6]
Obtained from: illumos (issues #1909, #1949, #1951, #1952, #1953, #1954) MFC after: 2 weeks
|
230438 |
21-Jan-2012 |
pjd |
Dramatically optimize listing snapshots when user requests only snapshot names and wants to sort them by name, ie. when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot properties.
Below you can find how long does it take to list 34509 snapshots from a single disk pool before and after this change with cold and warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null cold cache: 525s warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null cold cache: 1.7s warm cache: 1.1s
MFC after: 1 week
|
228103 |
28-Nov-2011 |
mm |
Merge new ZFS features from illumos:
1644 add ZFS "clones" property https://www.illumos.org/issues/1644
1645 add ZFS "written" and "written@..." properties https://www.illumos.org/issues/1645
1646 "zfs send" should estimate size of stream https://www.illumos.org/issues/1646
1647 "zfs destroy" should determine space reclaimed by destroying multiple snapshots https://www.illumos.org/issues/1647
1693 persistent 'comment' field for a zpool https://www.illumos.org/issues/1693
1708 adjust size of zpool history data https://www.illumos.org/issues/1708
1748 desire support for reguid in zfs https://www.illumos.org/issues/1748
Obtained from: illumos (changesets 13514, 13524, 13525) MFC after: 1 month
|
226707 |
24-Oct-2011 |
pjd |
- Use better naming now that we allow to rename any mounted file system (not only legacy). - Update copyright to include myself.
MFC after: 2 weeks
|
226676 |
24-Oct-2011 |
pjd |
Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' preperty set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back.
This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible.
MFC after: 2 weeks
|
224791 |
12-Aug-2011 |
pjd |
Eliminate the zfsdev_state_lock entirely and replace it with the spa_namespace_lock. This fixes LOR between the spa_namespace_lock and spa_config lock. LOR can cause deadlock on vdevs removal/insertion.
Reported by: gibbs, delphij Tested by: delphij Approved by: re (kib) MFC after: 1 week
|
224251 |
21-Jul-2011 |
delphij |
A different implementation of r224231 proposed by pjd@, which does not require change in the znode structure. Specifically, it queries rdev from the znode in the same sa_bulk_lookup already done in zfs_getattr().
Submitted by: pjd (with some revisions) Reviewed by: pjd, mm Approved by: re (kib)
|
224231 |
20-Jul-2011 |
delphij |
Add a new field to in-core znode, z_rdev, to represent device nodes.
PR: kern/159010 Reviewed by: mm@ Approved by: re (kib) MFC after: 2 weeks
|
224177 |
18-Jul-2011 |
mm |
ZFS tries to allocate blocks evenly across all devices. This means when devices are imbalanced zfs will lots of CPU searching for space on devices which tend to be pretty full. It should instead fail quickly on the full devices and move onto devices which have more availability.
New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)
Illumos-gate changeset: 13379:4df42cc92254
Obtained from: Illumos (Bug #1051) MFC after: 2 weeks
|
224174 |
18-Jul-2011 |
mm |
Resurrect the ZFS "aclmode" property Change default of "aclmode" to "discard".
Illumos-gate changeset: 13370:8c04143bd318
Obtained from: Illumos (Feature #742) MFC after: 2 weeks
|
222950 |
10-Jun-2011 |
gibbs |
Remove C constructs that are incompatible with C++ from various OpenSolaris and ZFS header files. These changes are sufficient to allow a C++ program to use the libzfs library.
Note: The majority of these files already included 'extern "C"' declarations, so the intention of providing C++ compatibility already existed even if it wasn't provided.
cddl/compat/opensolaris/include/assert.h: Wrap our compatibility assert implementation in 'extern "C"'. Since this is a compatibility header I matched the Solaris style of doing this explicitly rather than rely on FreeBSD's __BEGIN/END_DECLS macro.
sys/cddl/compat/opensolaris/sys/kstat.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h: Rename parameters in function declarations that conflict with C++ keywords. This was the solution preferred by members of the Illumos community.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h: In C, nested structures are visible in the global namespace, but in C++, they take on the namespace of the structure in which they are contained. Flatten nested structure definitions within struct zfs_cmd so these structures are visible in the global namespace when compiled in both languages.
Sponsored by: Spectra Logic Corporation
|
222267 |
24-May-2011 |
pjd |
Don't access task structure once we call task function. The task structure might be no longer available. This also allows to eliminates the need for two tasks in the zio structure.
Submitted by: anonymous MFC after: 2 weeks
|
221263 |
30-Apr-2011 |
mm |
Fix deduplicated zfs receive (dmu_recv_stream builds incomplete guid_to_ds_map)
Illumos-gate changeset: 13329:c48b8bf84ab7 MFC together with v28
Approved by: pjd Obtained from: Illumos (Bug #755)
|
219320 |
06-Mar-2011 |
pjd |
Fix libzpool build.
MFC after: 1 month
|
219317 |
05-Mar-2011 |
pjd |
Make renaming of a ZVOL, ZVOL's parent directory and ZVOL snapshot work.
Reported by: avg MFC after: 1 month
|
219089 |
27-Feb-2011 |
pjd |
Finally... Import the latest open-source ZFS version - (SPA) 28.
Few new things available from now on:
- Data deduplication. - Triple parity RAIDZ (RAIDZ3). - zfs diff. - zpool split. - Snapshot holds. - zpool import -F. Allows to rewind corrupted pool to earlier transaction group. - Possibility to import pool in read-only mode.
MFC after: 1 month
|
216919 |
03-Jan-2011 |
mm |
MFp4 r186485, r186859:
Fix a race by defining two tasks in the zio structure as we can still be returning from issue task when interrupt task is used.
Tested by: pjd Approved by: pjd, delphij (mentor) MFC after: 3 days
|
213198 |
27-Sep-2010 |
mm |
Properly handle IO with B_FAILFAST Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted
OpenSolaris revision and Bug IDs:
9725:0bf7402e8022 6843014 ZFS B_FAILFAST handling is broken
Approved by: delphij (mentor) Obtained from: OpenSolaris (Bug ID 6843014) MFC after: 3 weeks
|
213197 |
27-Sep-2010 |
mm |
Enable offlining of log devices.
OpenSolaris revision and Bug IDs:
9701:cc5b64682e64 6803605 should be able to offline log devices 6726045 vdev_deflate_ratio is not set when offlining a log device 6599442 zpool import has faults in the display
Approved by: delphij (mentor) Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442) MFC after: 3 weeks
|
212694 |
15-Sep-2010 |
mm |
Fix kernel panic when moving a file to .zfs/shares Fix possible loss of correct error return code in ZFS mount
OpenSolaris revisions and Bug IDs:
11824:53128e5db7cf 6863610 ZFS mount can lose correct error return
12079:13822b941977 6939941 problem with moving files in zfs (142901-12)
Approved by: delphij (mentor) Obtained from: OpenSolaris (Bug ID 6863610, 6939941) MFC after: 3 days
|
211932 |
28-Aug-2010 |
mm |
Import changes from OpenSolaris that provide - better ACL caching and speedup of ACL permission checks - faster handling of stat() - lowered mutex contention in the read/writer lock (rrwlock) - several related bugfixes
Detailed information (OpenSolaris onnv changesets and Bug IDs):
9749:105f407a2680 6802734 Support for Access Based Enumeration (not used on FreeBSD) 6844861 inconsistent xattr readdir behavior with too-small buffer
9866:ddc5f1d8eb4e 6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership
9981:b4907297e740 6775100 stat() performance on files on zfs should be improved 6827779 rrwlock is overly protective of its counters
10143:d2d432dfe597 6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc 6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission
10232:f37b85f7e03e 6865875 zfs sometimes incorrectly giving search access to a dir
10250:b179ceb34b62 6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)
10269:2788675568fd 6868276 zfs_rezget() can be hazardous when znode has a cached ACL
10295:f7a18a1e9610 6870564 panic in zfs_getsecattr
Approved by: delphij (mentor) Obtained from: OpenSolaris (multiple Bug IDs) MFC after: 2 weeks
|
211931 |
28-Aug-2010 |
mm |
Update ZFS metaslab code from OpenSolaris. This provides a noticeable write speedup, especially on pools with less than 30% of free space.
Detailed information (OpenSolaris onnv changesets and Bug IDs):
11146:7e58f40bcb1c 6826241 Sync write IOPS drops dramatically during TXG sync 6869229 zfs should switch to shiny new metaslabs more frequently
11728:59fdb3b856f6 6918420 zdb -m has issues printing metaslab statistics
12047:7c1fcc8419ca 6917066 zfs block picking can be improved
Approved by: delphij (mentor) Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066) MFC after: 2 weeks
|
209962 |
13-Jul-2010 |
mm |
Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced in Solaris 10 updates 141445-09 and 142901-14.
Detailed information: (OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)
7844:effed23820ae 6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)
7897:e520d8258820 6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)
7965:b795da521357 6740164 zpool attach can create an illegal root pool (141909-02)
8084:b811cc60d650 6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)
8121:7fd09d4ebd9c 6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)
8129:e4f45a0bfbb0 6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)
8188:fd00c0a81e80 6761100 want zdb option to select older uberblocks (141445-01)
8190:6eeea43ced42 6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)
8225:59a9961c2aeb 6737463 panic while trying to write out config file if root pool import fails (141445-01)
8227:f7d7be9b1f56 6765294 Refactor replay (141445-01)
8228:51e9ca9ee3a5 6572357 libzfs should do more to avoid mnttab lookups (141909-01) 6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)
8241:5a60f16123ba 6328632 zpool offline is a bit too conservative (141445-01) 6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01) 6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01) 6747698 checksum failures after offline -t / export / import / scrub (141445-01) 6745863 ZFS writes to disk after it has been offlined (141445-01) 6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01) 6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01) 6758107 I/O should never suspend during spa_load() (141445-01) 6776548 codereview(1) runs off the page when faced with multi-line comments (N/A) 6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)
8242:e46e4b2f0a03 6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)
8269:03a7e9050cfd 6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01) 6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01) 6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01) 6595194 "zfs get" VALUE column is as wide as NAME (141445-01) 6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01) 6396518 ASSERT strings shouldn't be pre-processed (141445-01)
8274:846b39508aff 6713916 scrub/resilver needlessly decompress data (141445-01)
8343:655db2375fed 6739553 libzfs_status msgid table is out of sync (141445-01) 6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01) 6784108 zfs_realloc() should not free original memory on failure (141445-01)
8525:e0e0e525d0f8 6788830 set large value to reservation cause core dump (141445-01) 6791064 want sysevents for ZFS scrub (141445-01) 6791066 need to be able to set cachefile on faulted pools (141445-01) 6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01) 6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)
8547:bcc7b46e5ff7 6792884 Vista clients cannot access .zfs (141445-01)
8632:36ef517870a3 6798384 It can take a village to raise a zio (141445-01)
8636:7e4ce9158df3 6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01) 6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01) 6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01) 6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01) 6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)
8692:692d4668b40d 6801507 ZFS read aggregation should not mind the gap (141445-01)
8697:e62d2612c14d 6633095 creating a filesystem with many properties set is slow (141445-01)
8768:dfecfdbb27ed 6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)
8811:f8deccf701cf 6790687 libzfs mnttab caching ignores external changes (141445-01) 6791101 memory leak from libzfs_mnttab_init (141445-01)
8845:91af0d9c0790 6800942 smb_session_create() incorrectly stores IP addresses (N/A) 6582163 Access Control List (ACL) for shares (141445-01) 6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A) 6800184 Panic at smb_oplock_conflict+0x35() (N/A)
8876:59d2e67b4b65 6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)
8924:5af812f84759 6789318 coredump when issue zdb -uuuu poolname/ (141445-01) 6790345 zdb -dddd -e poolname coredump (141445-01) 6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01) 6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01) 6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)
9030:243fd360d81f 6815893 hang mounting a dataset after booting into a new boot environment (141445-01)
9056:826e1858a846 6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)
9179:d8fbd96b79b3 6790064 zfs needs to determine uid and gid earlier in create process (141445-01)
9214:8d350e5d04aa 6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01) 6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)
9229:e3f8b41e5db4 6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)
9230:e4561e3eb1ef 6821169 offlining a device results in checksum errors (141445-01) 6821170 ZFS should not increment error stats for unavailable devices (141445-01) 6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)
9234:bffdc4fc05c4 6792139 recovering from a suspended pool needs some work (141445-01) 6794830 reboot command hangs on a failed zfs pool (141445-01)
9246:67c03c93c071 6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)
9276:a8a7fc849933 6816124 System crash running zpool destroy on broken zpool (141445-03)
9355:09928982c591 6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)
9391:413d0661ef33 6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)
9396:f41cf682d0d3 (part already merged) 6501037 want user/group quotas on ZFS (141445-03) 6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03) 6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03) 6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)
9404:319573cd93f8 6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)
9412:4aefd8704ce0 6717022 ZFS DMU needs zero-copy support (141445-03)
9425:e7ffacaec3a8 6799895 spa_add_spares() needs to be protected by config lock (141445-03) 6826466 want to post sysevents on hot spare activation (141445-03) 6826468 spa 'allowfaulted' needs some work (141445-03) 6826469 kernel support for storing vdev FRU information (141445-03) 6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03) 6826471 I/O errors after device remove probe can confuse FMA (141445-03) 6826472 spares should enjoy some of the benefits of cache devices (141445-03)
9443:2a96d8478e95 6833711 gang leaders shouldn't have to be logical (141445-03)
9463:d0bd231c7518 6764124 want zdb to be able to checksum metadata blocks only (141445-03)
9465:8372081b8019 6830237 zfs panic in zfs_groupmember() (141445-03)
9466:1fdfd1fed9c4 6833162 phantom log device in zpool status (141445-03)
9469:4f68f041ddcd 6824968 add ZFS userquota support to rquotad (141445-03)
9470:6d827468d7b5 6834217 godfather I/O should reexecute (141445-03)
9480:fcff33da767f 6596237 Stop looking and start ganging (141909-02)
9493:9933d599bc93 6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)
9512:64cafcbcc337 6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)
9515:d3b739d9d043 6586537 async zio taskqs can block out userland commands (142901-09)
9554:787363635b6a 6836768 zfs_userspace() callback has no way to indicate failure (N/A)
9574:1eb6a6ab2c57 6838062 zfs panics when an error is encountered in space_map_load() (141909-02)
9583:b0696cd037cc 6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)
9630:e25a03f552e0 6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)
9653:a70048a304d1 6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)
9688:127be1845343 6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A) 6843069 zfs get userused@S-1-... doesn't work (N/A)
9873:8ddc892eca6e 6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)
9904:d260bd3fd47c 6838344 kernel heap corruption detected on zil while stress testing (141445-06)
9951:a4895b3dd543 6844900 zfs_ioc_userspace_upgrade leaks (N/A)
10040:38b25aeeaf7a 6857012 zfs panics on zpool import (141445-06)
10000:241a51d8720c 6848242 zdb -e no longer works as expected (N/A)
10100:4a6965f6bef8 6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)
10160:a45b03783d44 6861983 zfs should use new name <-> SID interfaces (N/A) 6862984 userquota commands can hang (141445-06)
10299:80845694147f 6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)
10302:a9e3d1987706 6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)
10575:2a8816c5173b (partial merge) 6882227 spa_async_remove() shouldn't do a full clear (142901-14)
10800:469478b180d9 6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09) 6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)
10801:e0bf032e8673 (partial merge) 6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)
10810:b6b161a6ae4a 6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)
10890:499786962772 6807339 spurious checksum errors when replacing a vdev (142901-13)
11249:6c30f7dfc97b 6906110 bad trap panic in zil_replay_log_record (142901-13) 6906946 zfs replay isn't handling uid/gid correctly (142901-13)
11454:6e69bacc1a5a 6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)
11546:42ea6be8961b (partial merge) 6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)
Discussed with: pjd Approved by: delphij (mentor) Obtained from: OpenSolaris (multiple Bug IDs) MFC after: 2 months
|
208373 |
21-May-2010 |
mm |
Update L2ARC code and fix several bugs.
- improve ARC memory consumption (Bug ID 6488341) - ARC/L2ARC metadata accounting (Bug ID 6748019) - L2ARC turbo warmup (Bud ID 6748023) - kstats for ARC content (Bug ID 6748023) - kstats for evicted bytes from ARC by L2ARC state (Bud ID 6871680) - fix panic on i386 systems (Bug ID 6821260)
OpenSolaris onnv revisions: 8582:df9361868dbe, 8628:97dcded6e556, 9215:7c4584f76b47, 9274:a10f8bd993c1, 10357:29060492b29d
OpenSolaris Bug IDs: 6748019, 6748023, 6748030, 6488341, 6798268, 6821260, 6790261, 6871680
Approved by: pjd, delphij (mentor) Obtained from: OpenSlaris (multiple bug IDs) MFC after: 3 days
|
208370 |
21-May-2010 |
mm |
Fix: vdev_reopen() can lead to failed allocations
OpenSolaris onnv-revision: 7980:589f37f25048
Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (Bug ID 6764914) MFC after: 3 days
|
208166 |
16-May-2010 |
pjd |
Fix userland build by making io_task available only for the kernel and by providing taskq_dispatch_safe() macro.
MFC after: 1 week
|
208147 |
16-May-2010 |
pjd |
Add task structure to zio and use it instead of allocating one. This eliminates the only place where we can sleep when calling zio_interrupt(). As a side-effect this can actually improve performance a little as we allocate one less thing for every I/O.
Prodded by: kib MFC after: 1 week
|
208131 |
16-May-2010 |
mm |
Fix deadlock between zfs_dirent_lock and zfs_rmdir
OpenSolaris onnv revision: 11321:506b7043a14c
Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (Bug ID 6847615) MFC after: 3 days
|
208130 |
16-May-2010 |
mm |
Fix perfomance problem with ZFS prefetch caching [1] Add statistics for ZFS prefetch (sysctl kstat.zfs.misc.zfetchstats)
Partial import of OpenSolaris onnv revision 10474:0e96dd3b905a
Reported by: jhell@dataix.net (private e-mail) [1] Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (Bug ID 6859997, 6868951) MFC after: 3 days
|
208047 |
13-May-2010 |
mm |
Import OpenSolaris revision 7837:001de5627df3 It includes the following changes: - parallel reads in traversal code (Bug ID 6333409) - faster traversal for zfs send (Bug ID 6418042) - traversal code cleanup (Bug ID 6725675) - fix for two scrub related bugs (Bug ID 6729696, 6730101) - fix assertion in dbuf_verify (Bug ID 6752226) - fix panic during zfs send with i/o errors (Bug ID 6577985) - replace P2CROSS with P2BOUNDARY (Bug ID 6725680)
List of OpenSolaris Bug IDs: 6333409, 6418042, 6757112, 6725668, 6725675, 6725680, 6725698, 6729696, 6730101, 6752226, 6577985, 6755042
Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (multiple Bug IDs) MFC after: 1 week
|
207670 |
05-May-2010 |
mm |
Introduce hardforce export option (-F) for "zpool export". When exporting with this flag, zpool.cache remains untouched.
OpenSolaris onnv revision: 8211:32722be6ad3b
Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (Bug ID: 6775357)
|
207626 |
04-May-2010 |
mm |
Speed up ZFS list operation with objset prefetching.
Partial import of OpenSolaris onnv revisions: 8415:8809e849f63e, 10474:0e96dd3b905a
PR: kern/146297 Submitted by: myself Approved by: pjd, delphij (mentor) Obtained from: OpenSolaris (Bug ID 6386929, 6755389, 6847118) MFC after: 2 weeks
|
206797 |
18-Apr-2010 |
pjd |
Restore previous order.
|
205231 |
16-Mar-2010 |
kmacy |
- reduce contention by breaking up ARC state locks in to 16 for data and 16 for metadata - export L2ARC tunables as sysctls - add several kstats to track L2ARC state more precisely - avoid holding a contended lock when atomically incrementing a contended counter (no lock protection needed for atomics)
|
201143 |
28-Dec-2009 |
delphij |
Apply OpenSolaris revision 8012 which brings our zpool to version 14, making it possible for zpools created on OpenSolaris 2009.06 be used on FreeBSD.
PR: kern/141800 Submitted by: mm Reviewed by: pjd, trasz Obtained from: OpenSolaris MFC after: 2 weeks
|
200726 |
19-Dec-2009 |
delphij |
Apply fix for Solaris bug 6801979: zfs recv can fail with E2BIG (onnv revision 8986)
Requested by: mm Submitted by: pjd Obtained from: OpenSolaris MFC after: 2 weeks
|
200724 |
19-Dec-2009 |
delphij |
Apply fix Solaris bug 6462803 zfs snapshot -r failed because filesystem was busy.
Submitted by: mm Approved by: pjd MFC after: 2 weeks
|
197497 |
25-Sep-2009 |
pjd |
Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to be a bit weak and OpenSolaris also switched to fletcher4.
PR: kern/139072 Reported by: Daniel Grund <bugs@dgrund.de> MFC after: 3 days
|
197459 |
24-Sep-2009 |
pjd |
Before calling vflush(FORCECLOSE) mark file system as unmounted so the following vnops will fail. This is very important, because without this change vnode could be reclaimed at any point, even if we increased usecount. The only way to ensure that vnode won't be reclaimed was to lock it, which would be very hard to do in ZFS without changing a lot of code. With this change simply increasing usecount is enough to be sure vnode won't be reclaimed from under us. To be precise it can still be reclaimed but we won't be able to see it, because every try to enter ZFS through VFS will result in EIO.
The only function that cannot return EIO, because it is needed for vflush() is zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks z_teardown_lock and never returns EIO.
MFC after: 3 days
|
197153 |
13-Sep-2009 |
pjd |
When zfs.ko is compiled with debug, make sure that znode and vnode point at each other.
MFC after: 3 days
|
196703 |
31-Aug-2009 |
pjd |
Backport the 'dirtying dbuf' panic fix from newer ZFS version.
Reported by: Thomas Backman <serenity@exscape.org> MFC after: 1 week
|
196307 |
17-Aug-2009 |
pjd |
Manage asynchronous vnode release just like Solaris.
Discussed with: kmacy Approved by: re (kib)
|
196295 |
17-Aug-2009 |
pjd |
Remove OpenSolaris taskq port (it performs very poorly in our kernel) and replace it with wrappers around our taskqueue(9). To make it possible implement taskqueue_member() function which returns 1 if the given thread was created by the given taskqueue.
Approved by: re (kib)
|
195909 |
27-Jul-2009 |
pjd |
We don't support ephemeral IDs in FreeBSD and without this fix ZFS can panic when in zfs_fuid_create_cred() when userid is negative. It is converted to unsigned value which makes IS_EPHEMERAL() macro to incorrectly report that this is ephemeral ID. The most reasonable solution for now is to always report that the given ID is not ephemeral.
PR: kern/132337 Submitted by: Matthew West <freebsd@r.zeeb.org> Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com> Approved by: re (kib) MFC after: 2 weeks
|
194043 |
11-Jun-2009 |
kmacy |
pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous differences with Opensolaris' ZFS
Sorry for the churn
|
193980 |
11-Jun-2009 |
kmacy |
check against prefetch_enable
|
192800 |
26-May-2009 |
trasz |
MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly about removing a few #ifdefs and providing compatibility wrappers and VOP implementations to get and set an ACL; ZFS does ACL enforcement all by itself.
Note that the VOPs are ifdefed out for now, so this change should be a no-op.
Reviewed by: pjd
|
185029 |
17-Nov-2008 |
pjd |
Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
|
179310 |
25-May-2008 |
pjd |
Fix namespace collision after src/sys/sys/file.h:1.78.
|
178243 |
16-Apr-2008 |
kib |
Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock.
Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode.
The implementation of the lf_purgelocks() is submitted by dfr.
Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks
|
174049 |
28-Nov-2007 |
jb |
* Check endianness the FreeBSD way.
* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global variable.
|
168676 |
12-Apr-2007 |
pjd |
MFp4: Synchronize with vendor (mostly 'zfs rename -r').
|
168566 |
10-Apr-2007 |
pjd |
Try to stabilize ZFS with regard to memory consumption: - Allow to shrink ARC down to 16MB (instead of 64MB). - Set arc_max to 1/2 of kmem_map by default. - Start freeing things earlier when low memory situation is detected. - Serialize execution of arc_lowmem().
I decided to setup minimum ZFS memory requirements to 512MB of RAM and 256MB of kmem_map size. If there is less RAM or kmem_map, a warning will be printed. World is cruel, be no better. In other words: modern file system requires modern hardware:)
From ZFS administration guide:
"Currently the minimum amount of memory recommended to install a Solaris system is 512 Mbytes. However, for good ZFS performance, at least one Gbyte or more of memory is recommended."
|
168498 |
08-Apr-2007 |
pjd |
MFp4: Synchronize with recent OpenSolaris changes.
|
168404 |
06-Apr-2007 |
pjd |
Please welcome ZFS - The last word in file systems.
ZFS file system was ported from OpenSolaris operating system. The code in under CDDL license.
I'd like to thank all SUN developers that created this great piece of software.
Supported by: Wheel LTD (http://www.wheel.pl/) Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/) Supported by: Sentex (http://www.sentex.net/)
|